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Abstract 

£S) | The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, 

that is the same conditional independence relations of multivariate observational distributions, 
| has a long tradition; many algorithms exist for model selection and structure learning in Markov 

equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the 
case of interventional distributions arising from multiple intervention experiments. We show 
■ that under reasonable assumptions on the intervention experiments, interventional Markov 

equivalence defines a finer partitioning of DAGs than observational Markov equivalence and 
hence improves the identifiability of causal models. We give a graph theoretic criterion for two 
DAGs being Markov equivalent under interventions and show that each interventional Markov 
equivalence class can, analogously to the observational case, be uniquely represented by a chain 
graph called interventional essential graph (also known as CPDAG in the observational case). 
These are key insights for deriving a generalization of the Greedy Equivalence Search algorithm 
| aimed at structure learning from interventional data. This new algorithm is evaluated in a 

simulation study. 
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Directed acyclic graphs (or DAGs for short) are commonly used to model causal relationships 
between random variables; in such models, parents of some vertex in the graph are understood 
as "causes", and edges have the meaning of "causal influences". The causal influences between 
random variables imply conditional independence relations among them. However, those indepen- 
dence relations, or the corresponding Markov properties, do not identify the corresponding DAG 
completely, but only up to Markov equivalence. To put it simple, the skeleton of an underly- 
^ | ing DAG is completely determined by its Markov property, whereas the direction of the arrows 
(which is crucial for causal interpretation) is in general not encoded in the Markov property for 
the observational distribution. 

Interventions can help to overcome those limitations in identifiability. An intervention is re- 
alized by forcing the value of one or several random variables of the system to chosen values, 
destroying their original causal dependencies. The ensemble of both the observational and inter- 
ventional distributions can greatly improve the identifiability of the causal structure of the system, 
the underlying DAG. 

This paper has two main contributions. The first one is an algorithmically tractable graphical 
representation of Markov equivalence classes under a given set of interventions (possibly affecting 
several variables) from which the identifiability of causal models can be read off. This is of general 
interest for computation and algorithms dealing with structure (DAG) learning from an ensemble of 
observational and interventional data such as MC MC. The second cont ribution is a generalization of 
the Greedy Equivalence Search (GES) algorithm of Chickeringl ( 2002bl ). yielding an algorithm called 



Greedy Interventional Equivalence Search (GIES) which can be used for regularized maximum 
likelihood estimation in such an interventional setting. 
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In Section [2j we establish a criterion for two DAGs being Markov equivalent under a given 
intervention setting. We then generalize the concept of essential graphs, a graph theoretic repre- 
sentation of Markov equivalence classes, to the interventional case and characterize the properties 
of those graphs in Section In Section HI we elaborate a set of algorithmic operations to effi- 
ciently traverse the search space of interventional essential graphs and finally present the GIES 
algorithm. An experimental evaluation thereof is given in Section We postpone all proofs to 
Appendix [Bj while Appendix [A] contains a review on graph theoretic concepts and definitions. 
An im plementation o f the G IES algorithm will be available in the next release of the R package 



pcalg (jKalisch et all [2012J); meanwhile, a prerelease version is available upon request from the 



first author. 



1.1 Related Work 



The investigation of Markov equivalence classes of directed graphical models ha s a long tradition, 
perha ps starting with the criterion for two DAGs being Markov equivalent by Verma and Pearl 
i|l99d ) and culminating in the graph theoretic characterization of essential graphs (also called 
CP DAGs, "complet e d par tially directed acyclic graphs") representing Markov equivalence classes 
by lAndersson et al.l (|19971 ) . Several algorithms for estim ating essential graphs from observational 
data exist, such a s the PC algorithm jSpirtes et al. . 2000l ) or the Greedy Equivalence Search (GES) 
algori thm i 
(2005) and 



Murphvi (|200lh . 



Meekl. Il997l: IChickerind . l2002bh : a more complete overview is given in iBrown et al 



Different approaches to incorporate interventio nal data for learning ca us al models have been 
develo ped in the past. The Bayesian procedures of lCooper and Yool (|l999h or lEaton and Murphvi 
(|2007l ) address the problem of calculating a posterior (and also a likelihood) of an ensemble of 
observational and interventional data but do not address questions of identifiability or Markov 
equivalence: allowing different posteriors for Markov equivalent models can be intended in Bayesian 
methods (and realized by giving the corresponding model s different priors ). Since the number of 
DAGs with p variables grows super-exponentially with p ( Robinson . 19731 ). the computation of a 
full posterior is intractable. For this reason, the mentioned Bayesian approaches are limited to 
computing posterior probabilities for certain features of a DAG; such a feature could be an edge 
from a vertex a to another vertex b , or a directed path from a to b visiting additional vertices . 
Approaches based on active learning ( He and Geng . 2008 ; Tong and Koller . 2001 ; Eberhardt . 20081 ) 
propose an iterative line of action, estimating the essential graph with observational data in a 
first step and using interventional data in a second step to orient beforehand unorientable edges. 
He and Gene ( 20081 ) present a greedy procedure in which interventional data is uniquely used 
for deciding about edge orientations; this is not favorable from a statistical point of view since 
interventional data can also help to improve the e s timat ion of the skeleton (or, more generally, 



the observational essential graph). iTong and Kollerl (|200ll ) avoid this prob lem by using a Bayesian 



fram ework, but do no t address the issue of Markov equivalence therewith. lEberhardt et al 



tyesian 

EqqI) 



and Eberhardt ( 20081 ) provide algorithms for choosing intervention targets that completely identify 
all causal models of p variables uniformly, but neither address the question of partial identifiability 
under a lim ited number of in t erven tions nor provide an algorithm for learning the causal structure 
from data. Eberhardt et al. ( 20ld ) present an algorithm for learning cyclic linear causal models, 
but focus on complete identifiability; identifiability results for cyclic models only imply sufficient, 
but not necessary, conditions for the identifiability of acyclic models. 

Probably the most advanced result concerning identifiability of caus al models under single- 
variable interventions so far is given in the work of iTian and Pearll (|200ll ). Although they do not 
provide a characterization of equivalence classes as a whole (as this paper does), they present a 
necessary and sufficient graph theoretic criterion for two models being indistinguishable under a 
set of single- variable interventions as well as a learning algorithm based on the detection of changes 
in marginal distributions. 



2 



2 Model 



We consider p random variables (Xi, . . . ,X P ) =: X which take values in some product measure 
space (X, A, fi) = (IIf=i &l=\ Aii ®f=i w hh Xi C R V i. Each cr-algebra Ai is assumed to 
contain at least two disjoint sets of positive measure to avoid pathologies, and X is assumed to 
have a strictly positive joint density w.r.t. the measure \i on X. We denote the set of all positive 
densities on X by M.. For any subset of component indices A C [p] := {1, . . . ,p}, we use the 
notation Xa '■= FLeA^' -^a '■= (X a ) a eA and the convention Xq = 0. Lowercase symbols like xa 
represent a value in Xa- 

The model we are considering is built upon Markov properties with respect to DAGs. By 
convention, all graphs appearing in the paper shall have the vertex set [p], representing the p 
random variables X\, . . . ,X p . Our notation and definitions related to graphs are summarized in 
Appendix lA.li 



2.1 Causal Calculus: A Short Review 



We start by summarizing important facts and fixing our notation concerning Markov properties 
and intervention calculus. 



Definition 1 (Markov property: lLauritzen . 19961 ) Let D be a DAG. Then we say that a 
probability density / € M. obeys the Markov property of D if f(x) = Yii=i f( x i\ x p& D (i))- The 
set of all positive densities obeying the Markov property of D is denoted by M(D). 

Definition [T] is the most straightforward translation of independe nce relations induced from 
structural equations, the historical origin of directed graphical models (|WrightJ . ll92ll ). Related no- 



tions like local and global Markov pr operties ex i st and are equivalent to the factorization property 
of Definition [1] for positive densities ( Lauritzen . 19961 ). 



Definition 2 (Markov equivalence; lAndersson et all Il997l ) Let D\ and Di be two DAGs. 
D\ and D2 are called Markov equivalent (notation: D\ ~ D2) if M{D\) = A4(Z?2)- 



Theorem 3 ( Verma and Pearl . 1990l ) Two DAGs D\ and D2 are Markov-equivalent if and 
only if they have the same skeleton and the same v-structures. 

Directed graphical models allow for an obvious causal interpretation. For a density / that 
obeys the Markov properties of some DAG D, we can think of a random variable X a being the 
direct cause of another variable X^, if a is a parent of b in D. 



Definition 4 (Causal model) A causal model is a pair (D,f), where D is a DAG on the 
vertex set [p] and / G A4(D) is a density obeying the Markov property of D: D is called the 
causal structure of the model, and / the observational density. 



Causality is strongly linked to interventions. We consider stochastic interventions (jKorb et al 



2004) modeling the effect of setting or forcing one or several random variables Xj, where / C [p] 



is called the intervention target, to the value of independent random variables Ui, called inter- 
vention variables. The joint pr oduct densit y of Ui on Xj, called level density, is denoted by 



/. Extending the doQ operator (jPearll . Il995l ) to stochastic interventions, we denote the density 
of X under such an intervention by /(x|dou(A/ = Ui)). Using truncated factorization and the 
assumption of independent intervention variables, this interventional density can be written as 

fix I do D (A 7 = Ui)) = [J f(xi\x p , D{{) ) H f(xi) . (1) 

By denoting with 7 = and using the convention f(x\ do{X$ = U^)) = f(x), we also encompass 
the observational case as an intervention target. 
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Definition 5 (Intervention graph) Let D = ([p], E) be a DAG with vertex set [p] and edge set 
E (see Appendix lA.lj) . and / C [p] an intervention target. The intervention graph of D is the 

DAG flW = (]p],E(% where := {(a, b) | (a, b) e E,b I}. 

For a causal model (D, /), an interventional density /(-| do£)(X/ = Uj)) obeys the Markov property 
of D^ 1 ': the Markov property of the observational density is inherited. Figure [1] shows an example 
of a DAG and two corresponding intervention graphs. 

As foreshadowed in the introduction, we are interested in causal inference based on data sets 
originating from multiple interventions, that means from a set of the form S = {(Ij, /j)}/=i, where 
Ij C \p] is an intervention target and fj a level density on Xj j for 1 < j < J. We call such a set 
an intervention setting, and the corresponding (multi)set of intervention targets I = {Ij}j =1 
a family of targets. We often use the family of targets as an index set, for example to write a 
corresponding intervention setting as S = {(/, /j)}j 6 x- 

We consider interventional data of sample size n produced by a causal model (D,f) under 
an intervention setting S = {(I, /j)}/ez- We assume that the n samples X^ l \ . . . ,X^ are inde- 
pendent, and write them as usual as rows of a data matrix X. However, they are not identically 
distributed as they arise from different interventions. The interventional data set is fully specified 
by the pair (T, X), 

/ T« \ / -IW - \ 

T= i X= : (2) 

\ / V -iW — J 

where for each i G [n], TW denotes the intervention target under which the sample was 
produced. This data set can potentially contain observational data as well, namely if 6 X. To 
summarize, we consider the statistical model 

lW,I (2) r • • ,X {n) independent, 

X« ~ /( • I do D (X% = U Tli) )), U Tli) ~ f T( i), i = l,...,n, (3) 
and we assume that each target / G I appears at least once in the sequence T ■ 

2.2 Interventional Markov Equivalence: New Concepts and Results 

An intervention at some target a € [p] destroys the original causal influence of other variables 
of the system on X a . Interventional data thereof can hence not be used to determine the causal 
parents of X a in the (undisturbed) system. To be able to estimate at least the complete skeleton of 
a causal structure (as in the observational case), an intervention experiment has to be performed 
based on a conservative family of targets: 

Definition 6 (Conservative family of targets) A family of targets I is called conservative 

if for all a € [p], there is some I £ I such that a £ I. 

In this paper, we restrict our considerations to conservative families of targets; see Section [2. 31 for 
a more detailed discussion. Note that every experiment in which we also measure observational 
data corresponds to a conservative family of targets. 



1* 2 >3 >4 1* 2 >3 4 1< 2 3 >4 




5 >6 7 5 >6 7 5 >6 7 

(a) D ( b ) ,D<{ 4 » (c) Z?« 3 ' 5 » 



Figure 1: A DAG D and the corresponding intervention graphs D^ 4 }) and D^ 3 ' 5 ^ . 
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If a family of targets X contains more than one target, interventional data as in Equation (|3|) 
are not identically distributed. Whereas the distribution of observational data is determined by a 
single density, we need tuples of densities as in the following definition to specify the distribution 
of interventional data. 

Definition 7 Let D be a DAG on [p], and let X be a family of targets. Then we define 

Mz(D) :={(f {I) )iel G M m | V / G X : / (/) G M(D^), and 

V /, J G X, Va^/UJ: / (/) (^ a |x pao(a) ) = / (J) (^|x paD ( a) )} . 

Although the do() operator does not appear in Definition the elements in A4x(D) are exactly 
the tuples (/(-| do£>(A/ = C//)))/gx that can be realized as interventional densities of some causal 
model (D,f). The first condition in the definition reflects the fact that an intervention at a 
target I generates a density obeying the Markov property of D^ 1 '; the second condition is a 
consequence of the truncated factorization in Equation JT]). These considerations are formalized 
in the following lemma and motivate Definition [9] of interventional Markov equivalence in analogy 
to the observational case. Note that for X = {0}, Definition [7J equals its observational counterpart: 
M m (D) = M(D) (see Definition HJ . 

Lemma 8 Let D be a DAG on \p], andX a conservative family of targets. 

(i) Let (D,f) be a causal model (that is, f G M(D)), S = {(/, //)}/ex an intervention setting 
and Uj ~ // intervention variables for L G X. Then, we have 

(/(■ | do(X I = U I ))) IeI eM I (D) . 

(ii) Let (/ (/) )/ e i G M X (D). Then there is some positive density f G M(D) and an intervention 
setting S = {(I, //)}/eZ such that /(-|do(X/ = {//)) = f^\-) for random variables Ui with 
density fj, for all L G X. 

Definition 9 (Interventional Markov equivalence) Let D\ and D 2 be DAGs, and X a family 
of targets. D\ and D2 are called X-Markov equivalent (notation: D\ ~j D2) if MxiPi) = 
M.t(X)?)- The X-Markov equivalence class of a DAG D is denoted by [D]z- 

Alternatively, we will also use the term "interventionally Markov equivalent" when it is clear which 
family of targets is meant. For the simplest conservative family of targets, X = {0}, we get back 
Definition [2] for the observational case. We now generalize Theorem [3] for the interventional case in 
order to get a purely graph theoretic criterion for interventional Markov equivalence of two given 
DAGs, the main result of this section. 

Theorem 10 Let D\ and D2 be two DAGs on [p], and X a conservative family of targets. Then, 
the following statements are equivalent: 

(i) Di ~ z D 2 ; 

(ii) for all L G X, D^p ~ D^' (in the observational sense); 

(Hi) for all L G X, and have the same skeleton and the same v-structures; 
(iv) D\ and D2 have the same skeleton and the same v-structures, and Z)j and have the 
same skeleton for all L G X. 

2.3 Discussion 

Throughout this paper, we always assume the observational density / of a causal model to be 
strictly positive. This assumption makes sure that the conditional densities in Equation (pQ) are 
well-defined. The requirement of a strictly positive density can, however, be a restriction for 
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example for discrete models (where the density is with respect to the counting measure). In the 
observational case, the notion of Markov e quiva lence remains the same when we also allow densities 
that are not strictly positive (Lauritzen, 19961 ). We conjecture that the notion of interventional 



Markov equivalence (Definition and Theorem HO]) also remains valid for such densities; corre- 
sponding proofs would, however, require more caution to avoid the aforementioned problems with 
(truncated) factorization. 

To illustrate the importance of a conservative family of targets for structure identification, let 
us consider the simplest non-trivial example of a causal model with 2 variables X\ and Xi. Under 
observational data, we can distinguish two Markov equivalence classes: one in which the variables 
are independent (represented by the empty DAG -Do)) an d one in which they are not independent 
(represented by the DAGs D% := 1 — > 2 and D2 := 1 < — 2). D\ and D2 can be distinguished if we 
can measure data from an intervention at one of the vertices in addition to observational data; this 
experimental setting corresponds to the (conservative) family of targets X = {0, {1}}. However, an 
intervention at, say, X\ alone (that is, in the absence of observational data), corresponding to the 
non-conservative family X = {{1}}, only allows a distinction between the models D2 and Dq on 
the one hand (which do not show dependence between X\ and X2 under the intervention) and D\ 
on the other hand (which does show dependence between X± and X2 under the intervention). Note 
that the two indistinguishable models -Do an d D2 do not even have the same skeleton, and that it 
is impossible to determine the influence of X2 on X\ in the undisturbed system. In this setting, it 
would be more natural to consider the intervened variable X\ as an external parameter rather than 
a random variable of the system, and to perform regression to detect or determine the influence 
of X\ on X2. Note, however, that full identifiability of the models does not require observational 
data; interventions at X\ and X2 (corresponding to the conservative family X = {{1}, {2}} in our 
notation) are also sufficient. 

Theorem [10] is of great importance for the description of Markov equivalence classes under 
interventions. It shows that two DAGs which are interventionally Markov equivalent under some 
conservative family of targets are also observationally Markov equivalent: 

Di ~x D 2 ^D l ~ D 2 . (4) 

This implication is not true anymore for non-conservative families of targets. This is an explanation 
for the term "conservative" : a conservative family of targets yields a finer partitioning of DAGs into 
equivalence classes compared to observational Markov equivalence, but it preserves the "borders" 
of observational Markov equivalence classes. Figure [2] shows three DAGs that are observationally 
Markov equivalent, but which fall into two different interventional Markov equivalence classes 
under the family of targets X = {0, {4}}. 



Theorem 1101 agrees with Theorem 3 of iTian and Pearll (|200ll ) for single- variable interventions. 



While we also make a statement about interventions at several variables, they prove their theorem 
for perturbations of the system at single variables only, but for a wider class of perturbations called 
mechanism changes that go beyond our notion of interventions. While an intervention destroys 
the causal dependence of a variable from its parents (and hence replaces a conditional density by 
a marginal one in the Markov factorizatio n, see Equation (fTl)), a m echanism change (also known 



as "imperfect" or "soft" interventions; see lEator, and M.ro^ B alters the functional form of 



this dependence (and hence replaces a Markov factor by a different one which is still a conditional 
distribution). The fact that Theorem[TU]is true for mechanism changes on single variables motivates 
the conjecture that it also holds for mechanism changes on several variables. 



3 Essential Graphs 

Theorem [10] represents a computationally fast criterion for deciding whether two DAGs are in- 
terventionally Markov equivalent or not. However, given some DAG D, it does not provide a 
possibility for quickly finding all equivalent ones, and hence does not specify the equivalence class 
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5 ► 6 7 

(a) D 



5 ► 6 7 

(b) Di 



5 ► 6 7 

(c) D 2 



Figure 2: Three DAGs having equal skeletons and a single v-structure, 3 — > 6 < — 5, hence being 
observationally Markov equivalent. For X = {0, {4}}, we have D ~i D\, but D 9^2 D% since the 
skeletons of £)({ 4 1) (Figure 1(b)) and do not coincide. 



as a whole. In this section, we give a characterization of graphs that uniquely represent an in- 
terventional Markov equivalence class (Theorem I18p . Our charact erization of these interv entional 
essential graphs is inspired by and similar to the one developed bv lAndersson et al.l ( 19971 ) for the 
observational case and allows for handling equivalence classes algorithmically. Furthermore, we 
present a linear time algorithm for constructing a representative of the equivalence class corre- 
sponding to an interventional essential graph (Proposition [TU] and discussion thereafter), as well 
as a polynomial time algorithm for constructing the interventional essential graph of a given DAG 
(Algorithm [TJ . Throughout this section, X always stands for a conservative family of targets. 



3.1 Definitions and Motivation 

All DAGs in an X-Markov equivalence class share the same skeleton; however, arrow orientations 
may vary between different representatives (Theorem [10]). Varying and common arrow orientations 
are represented by undirected and directed edges, respectively, in X-essential graphs. 



Definition 11 (X-essential graph) Let D be a DAG. The X-essential graph of D is defined 
as £x(D) := Uzj'eLDV D 1 . (The union is meant in the graph theoretic sense, see Appendix lA.ip . 

When the family of targets X in question is clear from the context, we will also use the term 
interventional essential graph, w hile "observationa l essen tial graph" shall refer to the concept 
of essential graphs as introduced by lAndersson et al.l (|1997l ) in the observational case. Simply 



speaking of "essential graphs", we mean interventional or observational essential graphs in the 
following. 



Definition 12 (X-essential arrow) Let D be a DAG. An edge a — » b G D is X-essential in D 

if a — » b G D' V D' G [Z%. 

An X-essential graph typically contains directed as well as undirected edges. Directed ones cor- 
respond to arrows that are X-essential in every representative of the equivalence class; in other 
words, X-essential arrows are those whose direction is identifiable. A first sufficient criterion for 
an edge to be X-essential follows immediately from Lemma I4T1 (Appendix IB. 1|) . 



Corollary 13 Let D be a DAG with a — >6 G D. If there is an intervention target I G X such that 
\{a, b} Pi I\ = 1, then a — > b is X-essential. 



T he investiga t ion of e ssential graphs has a long tradition in the observational case (jAndersson et al 



19971 : IChickerineL l2002al ^ . Due to increased identifiability of causal structures, Markov equivalence 



classes shrink in the interventional case; Equation Q implies £x(D) C £^(D) for any conserva- 
tive family of targets X (see also Figure [8] in Section [5]). Essential graphs, interventional as well as 
observational ones, are mainly interesting because of two reasons: 



• It is important to know which arrow directions of a causal model are identifiable and which 
are not since arrow directions are relevant for the causal interpretation. 



1 2 3 > 4 




Figure 3: A graph with six arrows. Four of them are strongly X-protected for any conservative 
family of targets X (in parentheses: arrow configurations according to Definition I14p . Arrows 3 — >4 
and 4 — > 7 are strongly X-protected for X = {0, {4}}, but not for X = {0}. 



• Markov equivalent DAGs encode the same statistical model. Hence the space of DAGs 
is no suitable "parameter" or search space for statistical inference and computation. The 
natural search space is given by the set of the equivalence classes, the objects that can be 
distinguished from data. Essential graphs uniquely represent these equivalence classes and 
are efficiently manageable in algorithms. 

The characterization of X-essential graphs (Theorem [TBj) relies on the noti on of strongly X- 



protected arrows (Definition 1 14|) which reproduces the corresponding definition of lAndersson et al. 



(|1997l ) for X = {0}; an illustration is given in Figure [3] 



Definition 14 (Strong protection) Let G be a graph. An arrow a — > b £ G is strongly X- 
protected in G if there is some I £ I such that \I n {a, b}\ = 1, or the arrow a — > b occurs in at 
least one of the following four configurations as an induced subgraph of G: 

ci 

(a): a > b (b): a > b (c): a > b (d): a ► b 



c c C C2 

We will see in Theorem 1181 that every arrow of an X-essential graph (that is, every edge cor- 
responding to an X-essential arrow in the representative DAGs) is strongly X-protected. The 
configurations in Definition [Tl] guarantee the identifiability of the edge orientation between a and 
b: if there is a target I £ X such that | J D {a, b}\ = 1, turning the arrow would change the skeleton 
of the intervention graph (see also Corollary 1 13|) : in configuration (a), reversal would create a 
new v-structure; in (b), reversal would destroy a v-structure; in (c), reversal would create a cycle; 
an in (d) finally, at least one of the arrows between a and c\ or C2 must point aw ay from a in each 



repres entative, hence turning the arrow a — > b would create a cycle. We refer to Andersson et al. 



([19970 for a more detailed discussion of the configurations (a) to (d). 



3.2 Characterization of Interventional Essential Graphs 

As in the observational setting, we can show that interventional essential graphs are chain graphs 
with chordal chain components (see Appendix lA.ljl . For the obs ervational case X = |0| , Proposi- 



tions [15] and [16] below correspond to Propositions 4.1 and 4.2 of lAndersson et al.l ([19970 . 

Proposition 15 Let D be a DAG on \p\. Then: 
(i) £i(D) is a chain graph. 

(ii) For each chain component T £ T(£x(D)), the induced subgraph £x(D)[T] is chordal. 

Proposition 16 Let D be a DAG. A digraph D' is acyclic and X-equivalent to D if and only if 
D' can be constructed by orienting the edges of every chain component of £i(D) according to a 
perfect elimination ordering. 

This proposition is not only of theoretic, but also of algorithmic interest. According to the ex- 
planation in Appendix IA.21 perfect elimination orderings on the (chordal) chain components of 



8 



£x(D) can be generated with LexBFS (Algorithm [6j) ; doing this for all chain components yields 
computational complexity 0(|-E| + p), where E denotes the edge set of £x{D) (see Appendix IA,2|) . 

As an immediate consequence of Proposition [TBI interventional essential graphs are in one-to- 
one correspondence with interventional Markov equivalence classes. We will therefore also speak 
about "representatives of X-essential graphs" , where we mean representatives (that is, DAGs) of 
the corresponding equivalence class. Propositions 1151 and 1161 give the justification for the following 
definition; note that in order to generate a representative of some X-essential graph, the family of 
targets X need not be known. 

Definition 17 Let G be the X-essential graph of some DAG. The set of representatives of G is 
denoted by D(G): 

D(G) :={£> a DAG | D C G,D U = G U ,D[T] oriented according to some 

perfect elimination ordering for each chain component T € T(G)}. 



Here, D u denotes the skeleton of D (Appendix IA.1|) . We can now state the main result of this 
section, a graph theoretic characterization of X- essential graphs. For th e observational case X = 
{0}, this theorem corresponds to Theorem 4.1 of Andersson et al. ( 19971 ). 



Theorem 18 A graph G is the I-essential graph of a DAG D if and only if 
(i) G is a chain graph; 

(ii) for each chain component T G T(G), G[T] is chordal; 
(Hi) G has no induced subgraph of the form a — > b — c; 

(iv) G has no line a — b for which there exists some I £ X such that \I D {a, b}\ 
(v) every arrow a — >b € G is strongly I-protected. 



The graph G of Figure [3] satisfies points (i) to (iii) of Theorem [TSl For X = {0,{4}}, it also 



fulfills points (iv) and (v) in this case, it is the X-essential graph £x{D) of the DAG D of Figure 



1(a) by Proposition [TBI 



3.3 Construction of Interventional Essential Graphs 

In this section, we show that there is a simple way to construct the X-essential graph £i(D) of 
a DAG D: we need to successively convert arrows that are not strongly X-protected into lines 
(Algorithm [TJ . By doing this, we get a sequence of partial X-essential graphs. 

Definition 19 (Partial X-essential graph) Let D be a DAG. A graph G with D C G C £x{D) 
is called a partial X-essential graph of D if a — > b — c does not occur as an induced subgraph 
of G. 

The following lemma can be understood as a motivation for looking at such graphs. Note that 
due to the condition G C £x(D), and because G and £x(D) have the same skeleton, every arrow 



of £x(D) is also present in G, hence statement (ii) below makes sense. 

Lemma 20 Let D be a DAG. Then: 

(i) D and £x(D) are partial 1- essential graphs ofD. 

(ii) Let G be a partial X-essential graph of D. Every arrow a — >6 € £x(D) is strongly I-protected 
in G. 

(iii) Let G be a partial I-essential graph of two DAGs D\ and Z?2- Then, D\ ~j X>2- 

Algorithm [TJ constructs the X-essential graph G from a partial X-essential graph of any DAG 
D € D(G). The algorithm is indeed valid and calculates £x(D), since the graph produced in each 
iteration is a partial X-essential graph of D (Lemma l2Tj) , and the only partial X-essential graph 
that has only strongly X-protected arrows is £x(D) (Lemma [ 



9 



Lemma 21 Let D be a DAG and G a partial I -essential graph of D. Assume that a — > b £ G is 
not strongly X-protected in G, and let G' := G + (b,a) (that is, the graph we get by replacing the 
arrow a — >b by a line a — b; see Appendix \A.l\) . Then G' is also a partial X- essential graph of D. 

Lemma 22 Let D be a DAG. There is exactly one partial X-essential graph of D in which every 
arrow is strongly X-protected, namely £%{D). 

To construct £i(D) from some DAG D = (\p],E), we must, in the worst case, execute the 
iteration of Algorithm [T] for every arrow in the DAG; at each step, we must check every 4-tuple 
of vertices to see whether some arrow occurs in configuration (d) of Definition [TH Therefore 
Algorithm Q] has at most complexity Od-E) • p 4 ); by exploiting the partial order on T(G) (see 
Appendix IA.1|) . more efficient implementations are possible. Note that some checks only need to 
be done once. If, for example, an edge a — >5 is part of a v-structure (configuration (b) of Definition 
[T4|) . or if there is some l£l such that \L n {a, b}\ = 1 in the first iteration of Algorithm [H this 
will also be the case in every later iteration. 

3.4 Example: Identifiability under Interventions 

A simple example illustrates how much identifiability can be gained with a single intervention. We 
consider a linear chain as observational essential graph: 

G = £ m (D) :l — 2 — 3 p. 

We can easily count the number of representatives of G using the following lemma. 

Lemma 23 (Source lemma) Let G be a connected, chordal, undirected graph, and let D C G be 
a DAG without v-structures and with D u = G. Then D has exactly one source. 

Proof Let a be a topological ordering of D; then, u(l) is a source, see Appendix IA.11 It re- 
mains to show that there is at most one such source. Assume, for the sake of contradiction, 
that there are two different sources u and v. Since G is connected, there is a shortest u-v-path 
7 = (ao = u, a%, . . . , a,k = v). Let < — ai+i € D be the first arrow that points away from v in the 
chain 7 in D (note i > 1 since u — * a\ £ D by assumption). The v-structure aj_i — > a{ < — aj+i is 
not allowed as an induced subgraph of D, hence aj_i and a^+i must be adjacent in D and in G; 
however, 7 is then no shortest u-v-p&th, a contradiction. ■ 

For our linear chain G and any s £ \p], there is exactly one DAG D £ D(G) that has the 
(unique) source s, namely the DAG we get by orienting all edges of G away from s; other edge 
orientations would produce a v-structure. We conclude G has p representatives. 

Assume that the true causal model producing the data is (D,f), and denote the source of D 
by s £ [p\. Consider the conservative family of targets X = {0, {v }} with v £ [p]. If v < s, the 
interventional essential graph £i(D) is 

1 <— 2 <— . u + 1 — ... — p , 

and \T)(£x(D))\ = p—v by the same arguments as above; analogously, if v > s, we find \D(£z(D))\ = 
v — 1. On the other hand, if v = s, all edges of D are strongly Z-protected: those incident to s 



Input : G: partial I-essential graph of some DAG D (not known) 
Output: £x(D) 

while 3 a — > b G G s.t. a — > b not strongly I -protected in G do 
L G^G+(b,a); 

return G; 

Algorithm 1: REPLACEUNPROTECTED(X, G). Iterative construction of an Z-essential graph 
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because of the intervention target, all others because they are in configuration (a) of Definition 1141 
therefore, we have £%{D) = D. 

In the best case, all edge orientations in the chain can be identified by a single intervention, 
while the observational essential graph £^(D) that is identifiable from observational data alone 
contains p representatives. However, this needs an intervention at the a priori unknown source 
s. Choosing the central vertex [~|] as intervention target ensures that at least half of the edges 
become directed in £x(D), independent of the position s of the source. 



4 Greedy Interventional Equivalence Search 



Different algorithms have been proposed to estima te essential graphs from o bserva tional data. One 
of them, the Greedy Equivalence Search (GES) (|Meekl . 119971 : IChickerinej . l2002bh . is particularly 
interesting because of two properties: 

• It is score-based; it greedily maximizes some score function for given data over essential 
graphs. I t uses no tuning-pa rameter; the score function alone measures the quality of the 
estimate. IChickering (|2002bh chose the BIC score because of consistency; technically, any 
score equivalent and decomposable function (see Definition |24"|) is adequate. 



• It traverses the space of essential graphs which is the natural search space for model inference 
(see Section [3]) . We will see in Section O that a greedy search over equivalence classes yields 
much better estimation results than a naive greedy search over DAGs. 



GES greedily optimizes the score function in two phases ( Chickering . 2002bl ): 



• In the forward phase, the algorithm starts with the empty essential graph, Go := ([p],0). 
It then sequentially steps from one essential graph Gi to a larger one, Gi+i, for which there 
are representatives Di G D(Gj) and -Dj+i G D(Gj+i) such that Di + \ has exactly one arrow 
more than Di. 

• In the backward phase, the sequence (Gj), is continued by gradually stepping from one 
essential graph Gi to a smaller one, Gj+i, for which there are representatives Di G D(Gj) 
and Di + \ G D(Gj+i) such that D%+\ has exactly one arrow less than Di. 

In both phases, the respective candidate with maximal score is chosen, or the phase is aborted if 
no candidate scores higher than the current essential graph Gi. 

We introduce in addition a new turning phase which proved to enhance estimation (see Section 
E]). Here, the sequence (Gj)j is elongated by gradually stepping from one essential graph Gi to 
a new one with the same number of edges, denoted by Gj+i, for which there are representatives 
Di G D(Gj) and G D(Gj + i) such that Di + \ can be constructed from Di by turning exactly 
one arrow. As before, we choose the highest scoring candi date. Such a turni ng phase had already 
been proposed, but not characterized or implemented, by IChickering (|2002bh . 

Because GES is an optimization algorithm working on the space of observational essential 
graphs, and because the characterization of interventional essential graphs is similar to that of 
observational ones (Theorem I18j) . GES can indeed be generalized to handle interventional data 
as well by operating on interventional instead of observational essential graphs. We call this 
generalized algorithm Greedy Interventional Equivalence Search or GIES. An overview is shown in 
Algorithm [2j the forward, backward and turning phase are repeatedly executed in this order until 
none of them can augment the score function any more. 

A naive search strategy would perhaps traverse the space of DAGs instead of essential graphs, 
greedily adding, removing or turning single arrows from DAGs. It is well-known in the observational 
case that such an approach performs marked ly worse than one accounting for Markov equivalence 
(|Chickeringl . l2002bl : ICastelo and Kockl l2003h , and we will see in our simulations (Section [52]) that 
the same is true in the interventional case as long as few interventions are made. Ignoring Markov 
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equivalence cuts down the search space of successors at haphazard; since all DAGs in a Markov 
equivalence class represent the same statistical model, there is no justification for considering 
neighbors (that is, DAGs that can be reached by adding, removing or turning an arrow) of one of 
the representatives but not of the other ones. 

GIES can be used with general score functions. It goes without saying that the chosen score 
function should be a "reasonable" one which has favorable statistical properties such as consis- 
tency. We denote the score of a DAG D given interventional data (T, X) by S(D;T,'X.), and we 
assume that S is score equivalent, that is, it assigns the same score to Z-equivalent DAGs; I 
always stands for a conservative family of targets in this section. Furthermore, we require S to be 
decomposable. 



Definition 24 A score function S is called decomposable if for each DAG D, S can be written 
as a sum 

p 

S(D; T, X) = s (i, Pao(i); T, X), 
i=i 

where the local score s depends on X only via X., and X, pa with X.j denoting the i th 
column of X and X. pa @) the submatrix of X corresponding to the columns with index in pa D (i). 

Throughout the rest of this section, S always denotes a score equivalent and decomposable 
score function. Such a score function needs only be evaluated at one single representative of some 
interventional Markov equivalence class. Indeed, a key ingredient for the efficiency of the obser- 
vational GES as well as our interventional GIES is an implementation that computes the greedy 
steps to t he next equivalence class in a local fashion without enumerating all corresponding DAG 
members. Chickering ( 2002bl ) found a clever way to do that in the forward and backward phase 



of the observational GES. In Sections 14.11 and 14. 2\ we generalize his methods to the interventional 
case, and in Section 14. 3\ we propose an efficient implementation of the new turning phase. 



Input : (T, X): interventional data for family of targets X 
Output: I-essential graph 

g<-([p]>0); 

repeat 

DoContinue <- FALSE; 
repeat 

Gold G; 

G 4- ForwardStep(G; T, X) ; // See Algorithm^ 

until G id = G; 
repeat 

Gold G; 

G <- BackwardStep(G; T, X) ; // See Algorithm^ 

if Gold + G then DoContinue <- TRUE; 
until Gold = G; 
repeat 

Gold ^~ G; 

G <- TurningStep(G; T, X) ; // See Algorithm^ 

if Gold + G then DoContinue <- TRUE; 
until Gold = G; 
until -iDoContinue; 

Algorithm 2: GIES(T, X). Greedy Interventional Equivalence Search. The steps of the 
different phases of the algorithms are described in Algorithms [3]-[5j 
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4.1 Forward Phase 



A step in the forward phase of GIES can be formalized as follows: for an X-essential graph Gi, 
find the next one Gj+i := £x{Di + \), where 

Di + \ := argmax S(D ;7,X), and 

D + (Gj) := {D 1 a DAG | 3 an arrow u^v G D' : D' - (u,v) G D(Gj)} . 

If no candidate DAG D' G D + (Gj) scores higher than Gi, abort the forward phase. 

We denote the set of candidate 1-essential graphs by £ J(Gj) := {£x(D') \ D' G D + (Gj)}. 
In the next proposition, we show that each graph G' G £^(Gi) can be characterized by a triple 
(u, v, C), where u — >v is the arrow that has to be added to a representative D of Gi in order to get 
a representative D' of G", and C specifies the edge orientations of D within the chain component 
of v in G. 

Proposition 25 Let G be an I-essential graph, let u and v be two non-adjacent vertices of G, 
and let C C nec('u)- Then there is a DAG D G D(G) with {a G neciv) \ a — > v G D} = C such 
that D' := D + (u, v) G D + (G) i/ and only if 

(i) C is a clique in G[Tq(v)]; 

(ii) N := ne G (v) D &d G (u) C C; 

(Hi) and every path from v to u in G has a vertex in C . 
For given G, u, v and C determine D' uniquely up to X- equivalence. 



Note that points (i) and (ii) imply in particular that is a clique in G\Tc;{v)\. Proposition 1251 has 
already been proven for the case of observational data ( Chickerine . 2002bl . Theorem 15); it is not 



obvious, however, to see that this characterization of a forward step is also valid for interventional 
essential graphs, so we give a new proof in Appendix IB.3I using the results developed in Sections 
Eland El 

The DAGs D and D' in Proposition [25] only differ in the edge (u, v); v is the only vertex whose 
parents are different in D and D' . Since the score function S is assumed to be decomposable, the 
score difference between D and D' can be expressed by the local score change at vertex v, as stated 
in the following corollary. 



Input : G = ( [p] , E) : I-essential graph; (T, X) 
Output: G' S £j(G), or G 



interventional data for X 



AS n 



0: 



2 foreach v G [p] do 

foreach u G [p] \ a.dc(v) do 
N <— nec(v) n adc(u); 
foreach clique C C nec(v) with N C C do 
if /3 path from v to u in G[[p] \ C] then 



AS <- s(v, pa G (w) U C U {u}; T, X) - s(v, pa G (v) U C; T, X) 



/ / Proposition \2ffli)\ and (ii) 
// Provosition \2^fiii) 



if AS > AS max then 



AS max <- AS; 



) (u,w,C7); 



if AS max > then 

a LExBFS((C max 

Orient edges of G[Tc(v maK )] according to er; 
Insert edge (u max ,v max ) into G; 
return ReplaceUnprotected(I, G) ; 

else return G; 



// See Algorithm]^ 



Algorithm 3: ForwardStep(G; T, X). One step of the forward phase of GIES. 
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5 >6 7 5 >6 7 5 >6 7 

(b) 

(a) D (b) D' (c) S X (D') 



Figure 4: DAGs D, D' and Sx(D') illustrating a possible forward step of GIES for the fam- 
ily of targets X = {0, {4}}, applied to the X-essential graph G of Figure [3] for the parameters 
(u,v,C) = (4, 2, {3}) (notation according to Proposition [25]) . In parentheses in Figure (c): arrow 
configurations according to Definition [T41 arrows incident to 4 are strongly X-protected by the 
intervention target {4}. 



Corollary 26 Let G, u, v, C, D and D' be as in Proposition \ 25[ The score difference AS :- 
S(D'; T, X) — S(D; T, X) can be calculated as follows: 

AS = s(v, pa G (v) U C U {u}; T, X) - s(v , pa G (v) U C; T, X). 

In the observational case, this corollary corresponds to Corollary 16 of IChickerin 1 (l2002bh . 



The most straightforward way to construct an X-essential graph G' £ £^{G) characterized by 
the triple (u,v,C) as defined in Proposition [25] would be to create a representative D € D(G) by 
orienting the edges of Tq{v) as indicated by the set C, add the arrow u — >v to get D', and finally 
construct Sz(D') with Algorithm [1] The next lemma suggests a novel shortcut to this procedure: 
it is sufficient to orient the edges of the chain component Tq{v) only to get a partial X-essential 
graph of D' after adding the arrow u — > v . 

Lemma 27 Let G, u, v, C , D and D' be as in Proposition \2b\ Let H be the graph that we get 
by orienting all edges ofTc(v) as in D (leaving other chain components unchanged) and inserting 
the arrow (u,v). Then H is a partial I-essential graph of D' . 

Algorithm [3] shows our implementation of the forward phase of GIES, summarizing the results 
of Proposition [25] Corollary [261 and Lemma [27] Figure H] illustrates one forward step, applied to 
the X-essential graph G (for X = {0, {4}}) of Figure [3] and characterized by the triple (u, v, C) = 
(4, 2, {3}). Note that this triple is indeed valid in the sense of Proposition [25] {3} is clearly a 



clique (point (i) ), nee (2) n adc(4) = {3} (point (ii) ), and there is no path from 2 to 4 in G[[p] \ C] 
(point (iii) ). 



4.2 Backward Phase 

In analogy to the forward phase, one step of the backward phase can be formalized as follows: for 
an X-essential graph Gi, find its successor G«+i := £x{Di + i), where 

Di + i := argmax 5(D';X), and 
D'eD-(Gi) 

D-(Gi) := {D' a DAG \ 3 D e T>(Gi),u^v € D : D' = D - (u,v)} . 

If no candidate DAG D' € D + (Gj) scores higher than Gi, the backward phase is aborted. 

Whenever we have some representative D G D(G) of an X-essential graph G, we get a DAG 
in D~(G) by removing any arrow of D. This is in contrast to the forward phase where we do 
not necessarily get a DAG in D + (G) by adding an arbitrary arrow to D. By adding arrows, new 
directed cycles could be created, something which is not possible by removing arrows. This is the 
reason why the backward phase is generally simpler to i mplement than the forward phase. 

In Proposition [28] (corresponding to Theorem 17 of Chickering ( 2002bl ) for the observational 



case), we show that we can, similarly to the forward phase, characterize an X-essential graph of 
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fj(G) := {£x(D') | D' G D~(G)} by a triple (u,v,C), where C is a clique in nec(v). As in the 
forward phase, we see that the score difference of D and D' is determined by the local score change 
at the vertex v (Corollary [29]), and that lines in chain components other than Tq{v) remain lines in 
G' = Sx(D') (Lemma I3U|) . Algorithm [J] summarizes the results of the propositions in this section. 



Proposition 28 Let G = (\p],E) be an I-essential graph with (u,v) G E (that is, u — v G G 
or u — > v € G), and let C C nec(u). There is a DAG D G D(G) with u — > v G D and {a G 
neo(v) \ {u} | a — > v G D} = C such that D' := D — (u, v) G D~(G) if and only if 

(i) C is a clique in G[Tq(v)]; 

(ii) C C N := ne G (v) n ada(u). 

Moreover, u, v and C determine D' uniquely up to X-equivalence for a given G. 



Corollary 29 Let G, u, v, C, D and D' be as in Proposition The score difference AS 
S(D';T,X) - S(D;T,X) is: 

AS = s{v, {p& G (v) U C) \ {u}; T, X) - s{v, pa G (w) UCU {u}; T, X). 



In the observational case, this corresponds to Corollary 18 in IChickerind (|2002bl ). The analogue 
to Lemma [27] for a computational shortcut in the forward phase reads as follows: 



Lemma 30 Let G, u, v, C, D and D' be as in Proposition\28l Let H be the graph that we get by 
orienting all edges ofTc(v) as in D and removing the arrow (u,v). Then H is a partial I-essential 
graph of D' . 

A backward step of GIES is summarized in Algorithm U] and illustrated in Figure [5] The triple 
(u, v, C) = (2, 5, 0) used there to characterize the backward step obviously fulfills the requirements 
of Proposition [28l 



Input : G — ( [p] , E) : I-essential graph; (T, X) : interventional data for I 
Output: G' € £z(G), or G 
A5 max <- 0; 
foreach v G [p] do 

foreach u e nea(v) U p& G (v) do 
N <— nec(v) n adc(u); 
foreach clique C C N do 

AS 4- s(v, (p& G (v) U C) \ {u}- T, X) - s(v, pa G (w) U C U {u}; T, X); 
if AS" > AS^ then 
AS milx <- AS; 

C max ) <r- (u,v,C); 

if AS^ > then 

if "max G ne G (u max ) then a <- LExBFS((C max , u max , u ma x, ■ ■ ■), E[T G (v max )]); 

else (T LEXBFS((C ma x,Wmax,---) ; ^[ T G(Wmax)]); 

Orient edges of G[TG(w max )] according to er; 
Remove edge (u max , fmax) from G; 

return ReplaceUnprotected(X, G) ; // See Algorithm^ 

else return G; 

Algorithm 4: BackwardStep(G; T, X). One step of the backward phase of GIES. 
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1< 2 >3 >4 1< 2 >3 >4 1< 2- -3 >4 




5 >6 7 5 >6 7 5 >6 7 

(b) 

(a) O (b) D' (c) &(£>') 



Figure 5: DAGs D, D' and Sx(D') illustrating a possible backward step of GIES for the family of 
targets X = {0, {4}}, applied to the X-essential graph G of Figure [3] for the parameters (u, v, C) = 
(2,5,0) (notation according to Proposition [28]). Figure (c), in parentheses: arrow configurations 
according to Definition 1141 



4.3 Turning Phase 

Finally, we characterize a step of the turning phase of GIES, in which we want to find the successor 
Gj+i := £x{Di+i) for an Z-essential graph Gj by the rule 

argmax S(D';T,X.), where 

{D' a DAG \D' £ D(G;), and 3 an arrow u—>v£D': 
D' — (u,v) + (v,u) G D(Gj)} . 

When the score cannot be augmented anymore, the turning phase is aborted. The additional 
condition 11 D' £ D(Gj)" is not necessary in the definitions of D + (Gj) and D~(Gj); when adding 
or removing an arrow from a DAG, the skeleton changes, hence the new DAG is certainly not 
Z-equivalent to the previous one. However, when turning an arrow, the skeleton remains the same, 
and the danger of staying in the same equivalence class exists. 

Again, we are looking for an efficient method to find a representative D' for each G' G £j(Gj) := 
{Sx(D') | D' G D^Gj)}. It makes sense to distinguish whether the arrow that should be turned 
in a representative D G D(Gj) is X-essential or not. We start with the case where we want to turn 
an arrow which is not X-essential. 

Proposition 31 Let G be an X-essential graph with u — v G G, and let C C nec(v) \ {u}. Define 
N := nec(v) nadc(^). Then there is a DAG D G D(G) with u< — v G D and {a G nec(v) \ a — >v G 
D} = C such that D' := D - (v, u) + (u, v) G D°(G) if and only if 

(i) C is a clique in G\Tq{v)\; 

(ii) C\N^%; 

(Hi) C D N separates C\N and N\C in G[ne G (v)\. 
For a given G, u, v and C determine D' up to X- equivalence. 

There are now two vertices that have different parents in the DAGs D and D', namely u and v; 
thus the calculation of the score difference between D and D' involves two local scores instead of 
one. 

Corollary 32 Let G, u, v, C , D and D' be as in Proposition \31[ Then the score difference 
AS := S(D';T, X) — S(D;T,X.) can be calculated as follows: 

AS = s(v, pa G (v) U G U {u}; T, X) + s(u, pa G (u) U (G n N); T, X) 

- s(v, p& G (v) U G; T, X) - s(u, pa G (u) U (G n N) U {v}; T, X). 

Lemma 33 Let G, u, v, C , D and D' be as in Proposition \31[ Let H be the graph that we get by 
orienting all edges ofTc(v) as in D and turning the arrow (v,u). Then H is a partial X-essential 
graph of D' . 



Di+i := 
D°(G,) := 
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(a) D 



(b) D' 



5 ► 6 

(b) 

(c) £ X (D') 



Figure 6: DAGs D, D' and £z(D') illustrating a possible turning step of GIES applied to the 
X-essential graph G (X = {0,{4}}) of Figure [3] for the parameters (u,v,C) = (5, 2, {3}) (notation 
of Proposition I31[) . The arrow 2 — > 5 is not X-essential in D. Figure (c): arrow configurations in 
parentheses, see Definition 1141 



1 <- 



(a) G 



2 <- 



(b) D 



(b) (b) 
1 > 2 < 3 



(b) 



-> 5 



(b) 

(c) D' = Sx(D') 



Figure 7: Graphs G, D, D' and Sz(D') illustrating a possible turning step of GIES for the family 
of targets X = {0, {4}} and the parameters (u, v, C) = (1, 2, {3}) (notation of Proposition [34"|). The 
arrow 2 — >1 is X-essential in D. Figure (c): arrow configurations in parentheses, see Definition 1141 



A possible turning step is illustrated in Figure El where a non-X-essential arrow (for X = 
{0, {4}}) of a representative of the graph G of Figure [3] is turned. The step is characterized by the 
triple (u, v, C) = (5, 2, {3}) which satisfies the conditions of Proposition[3TJ {3} is obviously a clique 
(point [(I)]), C\N = C since N = {1} (point pi, and C\N = {3} and N\C = {1} are separated 



in G[nec(2)] (point (iii) ). In contrast, the triple (u,v,C) = (5, 2, {1}) fulfills points (i) and (iii) of 
Proposition Ell but not point [(ji)J There is a DAG D G D(G) with {a G ne G (2) | a—>2 € D} = {1}, 
and turning the arrow 2 — > 5 in D yields another DAG D' (that is, does not create a new cycle). 
This new DAG D', however, is X-equivalent to D, and hence not a member of D ( - ) (G) (see the 
discussion above). 

We now proceed to the case where an X-essential arrow of a representative of G is turned; here 
there is no danger to remain in the same Markov equivalence class. The characterization of this 
case is similar to the forward phase. 



Proposition 34 Let G be anl-essential graph with u< — v £ G, and let C C nea(v). Then there is 
a DAG D eD(G) with {a G ne G {v) j a — >v e D} = C such that D' := D — (v, u) + (n, v) G D°(G) 
if and only if 

(i) C is a clique; 

(ii) N := ne G (v) D &d G (u) C C; 

(iii) every path from v to u in G except (v,u) has a vertex in C U nec(u). 
Moreover, u, v and C determine D' up to I -equivalence. 



Chickerind (|2002ah has already proposed a turning step for essential arrows in the observational 
case; however, he did not provide necessary and sufficient conditions specifying all possible turning 
steps as Proposition EH does. 



Lemma 35 Let G, u, v, C , D and D' be as in Proposition \34\ and let H be the graph that we 
get by orienting all edges of ' Tq{v) and Tq{u) as in D and by turning the edge (v,u). Then H is 
a partial I-essential graph of D' . 
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To construct a G' G £j(G) out of G, we must possibly orient two chain components of G 
instead of one (Lemma l35j) . In the example of Figure \7\ we see that it is indeed not sufficient to 
orient the edges of Tq(v) alone in order to get a partial X-essential graph of G' . The arrow 1 — > 5 
is not X-essential in D, hence 5 £ Tq(1). However, the same arrow is X-essential in D' and hence 
also present in £z(D'). 

Despite the fact that we need to orient the edges of Tq(v) and Tq(u) to get a partial X- 
essential graph of D', Sx(D') is nevertheless determined by the orientation of edges adjacent to v 
(determined by the clique C) alone. This comes from the fact that in D, defined as in Proposition 
[Ml all arrows of D[Tq{u)\ must point away from u. 

Corollary 36 Let G, u, v, C , D and D' be as in Proposition \34\ Then the score difference 
AS := S(D'; T, X) — S(D; T, X) can be calculated as follows: 

AS = s(v, p& G (v) U C U {u}; T, X) + s(u, p& G (u) \ {v }; T, X) 
-s(v, p& G {v) U G; T, X) - s(u, pa G (n); T, X). 

The entire turning step, for essential and non-essential arrows, is shown in Algorithm [5j 



Input : G = ( [p] , E) : Z-essential graph; (T, X) : interventional data for X 
Output: G S S%, or G 

ASmax <- 0; 

foreach v G [p] do 

foreach u G ne G (u) do // Consider arrows that are not X- essential for turning 

N <— ne G (w) H ad G (u); 

foreach clique C C ne G (v) \ {u} do // Provosition \3\]/i)\ 

if C \ N ^ and {u, v} separates C and N\C in G[T G (v)} then 



// Provosition \3jftnj\ and (in) 
AS <- s(v, pa G (v) U C U {w}; T, X) + s(u, pa G (u) U (G n AT); T, X); 
AS <— AS" - s(«, pa G (u) U C; T, X) - s(u, pa G (u) U (C n N) U {«}; T, X); 
if AS > AS roax then 
AS max <- AS; 

C m ax) <~ (U,V,C); 



foreach u G ch G (v) do // Consider X-essential arrows for turning 

N <— ne G (v) n ad G (u); 

foreach clique C C ne G (w) with N C C do // Proposition \3$ij\ and (ii) 

if /3 pai/i /rom v to u in G[[p] \ (C U ne G (it))] — (u, u) then // Proposition \3$m) 

AS ^- s(u, pa G (w) U G U {w}; T, X) + s(u, pa G (u) \ {«}; T, X); 
AS <- AS - s(v, pa G (w) U C; T, X) - s(u, pa G (u); T, X); 



AS, 



then 

l'l; UC ^ AS] 



i^max) ^max; C m ax) <~ (U,V,C); 



if AS'max > then 



if v 



max 



G G then 



LEXBFS((u max , • ■ .),£[T G (u mH[ )]); 
Orient edges of G[T G (u m ax)] according to er; 
a v := LExBFS((C ma x ■),E[Tc(v max )]y, 
else := LExBFS((C max ,iJ max ,)i max ,...),£[T G (D„ 
Orient edges of G[T G (u ma x)] according to cr v ; 
Turn edge (v ma x,Umax) in G; 
return ReplaceUnprotected(2, G) ; 

else return G: 



<)]); 



// See Algorithm]^ 



Algorithm 5: TurningStep(G; T, X). One step of the turning phase of GIES. 
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4.4 Discussion 



Every step in the forward, backward and turning phase of GIES is characterized by a triple (u, v, C), 
where u and v are different vertices and C is a clique in the neighborhood of v. To identify the 
highest scoring movement from one X-essential graph G to a potential successor in £^(G), £%(G) 
or £j(G), respectively, one potentially has to examine all cliques in the neighborhood nec(v) of 
all vertices v 6 [p]. The time complexity of any (forward, backward or turning) step applied to an 
X-essential graph G hence highly depends on the size of the largest clique in the chain components 
of G. By restricting GIES to X-essential graphs with a bounded vertex degree, the time complexity 
of a step of GIES is polynomial in p; otherwise, it is in the worst case exponential. We believe, 
however, that GIES is in practice much more efficient than this worst-case complexity suggests. 
Some evidence for this claim is provided by the runtime analysis of our simulation study, see 
Section 1^21 

A heuristic approach to guarantee polynomial runtime of a greedy search has been proposed 
bv lCastelo and Kockal (|2003h for the observational case. Their Hill Climber Monte Carlo (HCMC) 
algorithm operates in DAG space, but to account for Markov equivalence, the neighborhood of a 
number of randomly chosen DAGs equivalent to the current one is scanned in each greedy step. 
The equivalence class of the current DAG is explored by randomly turning "covered arrows" , that 
is, arrows whose reversal does not change the Markov property. In our (interventional) notation, an 
arrow is covered if and only if it is not strongly X-protected (Definition ll4|) . By limiting the number 
of covered arrow reversals, a polynomial runtime is guaranteed at the cost of potentially lowering 
the probability of investigating a particular successor in £^(G), £~^{G) or £j-(G), respectively. 
HCMC hence enables a fine tuning of the trade-off between exploration of the search space and 
runtime, or between greediness and randomness. 

The order of executing the backward and the turning phase seems somewhat arbitrary. In the 
analysis of the steps performed by GIES in our simulation study (Section 15. 2p . we saw that the 
turning phase can generally only augment the score when very few backward steps were executed 
before. For this reason, we believe that changing the order of the backward and the turning phase 
would have little effect on t he overall performa nce of GIES. 

As already discussed bv IChickerind (|2002bl ) for the observational case, caching techniques can 
markedly speed up GES; the same holds for GIES. The basic idea is the following: in a forward step, 
the algorithm evaluates a lot of triples (u,v, C) to choose the best one, (u ma x 5 f ma x, C max ) (lines 
[2] to [10] in Algorithm [3]) . After performing the forward move corresponding to (u max , v max , C max ), 
many of the triples evaluated in the step before are still valid candidates for next step in the sense 
of Proposition I25I and lead to the same score difference as before (see Corollary [26]) . Caching those 
values avoids unnecessary reevaluation of possible forward steps. The same holds for the backward 
and the turning phase; since the forward step is most frequently executed, a caching strategy in 
this phase yields the highest speed-up though. 

We emphasize that the characterization of "neighboring" X-essential graphs in £^(G), £j(G) or 
£^(G), respectively, by triples (u, v, C) is of more general interest for structure learning algorithms, 
for example for the design of sampling steps of an MCMC algorithm. Also the beforementioned 
HCMC algorithm could be extended to interventional data by generalizing the notion of "covered 
arcs" using Definition I14l 

The prime example of a score equiva lent and decomposable score function is the Bayesian infor- 
mation criterion (BIC) (jSchwarzl . Il978l ) which we used in our simulations (Section [5]). It penalizes 
the complexity of causal models by their number of free parameters (£q penalization); this number 
is the sum of free parameters of the conditional densities in the Markov factorization (Definition 
[T]), which explains the decomposability of the score. Using different penalties, for example, £2 
penalization, can lead to a non-decomposable score function. GIES can also be adapted to such 
score functions; the calculation of score differences becomes computationally more expensive in 
this case since it cannot be done in a local fashion as in Corollaries [26] l29l [321 and [36] 

GIES only relies on the notion of interventional Markov equivalence, and on a score function 
that can be evaluated for a given class of causal models. As we mentioned in Section 12.11 we 
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believe that interventional Markov equivalence classes remain unchanged for models that do not 
have a strictly positive density. For this reason it should be safe to also apply GIES to such a 
model class. 



5 Experimental Evaluation 

We evaluated the GIES algorithm on simulated interventiona l data (Section 15.21) a nd on in silico 
gene expression data sets taken from the DREAM4 challenge ( Marbach et al. . 20 id ) (Section I5.3p . 
In both cases, we restricted our considerations to Gaussian causal models as summarized in Section 



5.1 Gaussian Causal Models 

Consider a causal model (D, /) with a Gaussian density of the form A/*(0, E). The observational 
Markov property of such a model translates to a set of linear structural equations 

p 

X % = PijXj m ~ P ' JV(0, 1 < i < p , (5) 

where /3jj = if j £ p& D (i). When the DAG structure D is known, the covariance matrix E can 
be parameterized by the weight matrix 

B := 0%)? ii=1 G B(D) := {A = (a y ) G R pxp \ a tJ = if j £ pa D «} 

that assigns a weight to each arrow j — > i G D, and the vector of error covariances a 2 := 
(of,...,o|): 

E = Cov(A) = (1 - diag(cr 2 )(l - £)" T . 

This is a consequence of Equation (jSJ) . 

We always assume Gaussian intervention variables J7j (see Section [2.ip . In this case, not only 
the observational density / is Gaussian, but also the interventional densities f(x \ do£)(A/ = Ui)). 
An interventional data set (T, X) as defined in Equation (|2|) then consists of n independent, but 
not identically distributed Gaussian samples. 

We use the Bayesian information criterion (BIC) as score function for GIES: 

S(D; T, X) := sup{£ D (B, a 2 ; T, X) | B G B(D), a 2 G R p >0 } - ^ log(n) , 
where £d denotes the log-likelihood of the density in Equation ©: 

n 

l D (B,a 2 ;T,X) := £log/(X« | do D (X® } = U T[i) )) (6) 
i=l 

n 

i=l j<£T(*) j6TW 
1 n 1 

1 = 1 j£T(i) 3 

= -\t[\^\3iT^}\lo g a 2 + ^ £ (X?~B,X 
j=i j i-jprH) 



2 - 



+ c 



where the constant C is independent of the parameters (B,a 2 ) of the model. Since Gaussian causal 
models with structure D are parameterized by B G B(D) and cr 2 G K>Q' we nave hp = p + \E\ free 
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parameters, where E denotes the edge set of D. It can be seen in Equation ([6]) that the maximum 
likelihood estimator (MLE) (B, a 2 ), the maximizer of ip, mini mizes the residual sum o f squa res 



for the different structural equations; for more details we refer to lHauser and Biihlmannl (|2012l ). 

The DAG D maximizing the BIC yields a consistent estimator for the true causal structure D 
in the sense that P[D ~j D] — > 1 in the limit n — > oo as long as the true density / is faithful 
with respect to D, that is, every condit i onal i ndependence relation of / is encoded in the Markov 
property of D ( Hauser and Buhlmann . 20121 ). Note that the BIC score is even defined in the 



high-dimensional setting p > n; however, we only consider low-dimensional settings here. 
5.2 Simulations 

We simulated interventional data from 4000 randomly generated Gaussian causal models as de- 
scribed in Section f5.2. 11 In Sections 15.2.21 and l5.2.3} we present our methods for evaluating GIES; 
the results are discussed in Section [5.2.41 As a rough summary, GIES markedly beat the concep 



tually simpler greedy search over the space of DAGs as well as the original GES of IChickering 



(|2002bh ignoring the interventional nature of the simulated data sets. Its learning perf 
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could keep up with a provably consistent exponential time dynamic programming algorithm at 
much lower computational cost. 

5.2.1 Generation of Gaussian Causal Models 

For some number p of vertices, we randomly generated Gaussian causal models parameterized by a 
structure D, a weight matrix B £ B(-D) and a vector of error covariances a 2 6 M> by a procedure 



slightly adapted from lKalisch and Biihlmannl (|20071 ): 



1. For a given sparseness parameter s € (0,1), draw a DAG D with topological ordering 
(1, . . . ,p) and binomially distributed vertex degrees with mean s(p — 1). 

2. Shuffle the vertex indices of D to get a random topological ordering. 

3. For each arrow j — >i E D, draw /5^ - ~ U([—l, —0.1] U [0.1, 1]) using independent realizations; 
for other pairs of set /3^- = (see Equation ([5])). This yields a weight matrix B' = 
(fi'ij)ij=l ^ -^H-^) w ith positive as well as negative entries which are bounded away from 0. 

4. Draw error variances a' 2 l ~ ' U([0.5, 1]). 

5. Calculate the corresponding covariance matrix £' = (1 - B'y 1 diag(cr /2 )(l - B')~ T . 

6. Set H := diag((S / 11 )- 1 / 2 , . . . , (S^)" 1 / 2 ), and normalize the weights and error variances as 
follows: 

B := HB'H-\ (a 2 , a 2 p f := H 2 (a' 2 , . . . , a' 2 f . 
It can easily be seen that the corresponding covariance matrix fulfills 

S = (1 - B)- 1 diag(a 2 )(l - B)~ T = HE'H , 
ensuring the desired normalization = 1 for all i. 



Steps Q] and [3] are provided by the function randomDAGO of the R-package pcalg (jKalisch et al 
20121 ). 



We considered families of targets of the form X = {0, Ji, . . . , !&}, where Ji, . . . , If. are k different, 
randomly chosen intervention targets of size m; the target size m had values between 1 and 4. For 
a fixed sample size n, we produced approximately the same number of data samples for each 
target in the family 1 by using a level density M((2, . . . , 2), (0.2) 2 l m ) in each case (see the model 
in Equation (HJ). With this choice and the aforementioned normalization of X, the mean values 
of the intervention levels lay 2 standard deviations above the mean values of the observational 
marginal distributions. In total, we considered 4000 causal models and simulated 128 observational 
or interventional data sets from each of them by combining the following simulation parameters: 
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• (p,s) € {(10, 0.2), (20, 0.1), (30, 0.1), (40, 0.1)} with 1000 DAGs each. 

• k = 0, 0-2p, OAp, ... ,p for each value of p; the first setting is purely observational. 

• me {1,2,4}. 

• ne {50, 100, 200, 500, 1000, 2000, 5000, 10000}. 

In addition, we generated causal models with p G {50, 100, 200} (100 DAGs each) and p = 
500 (20 DAGs) with an expected vertex degree of 4 (which corresponds to a sparseness pa- 
rameter of s = 4:/(p — 1)) and simulated 6 data sets for the parameters k = 0.4 and n € 
{1000, 2000, 5000, 10000, 20000, 50000} from each of these models. We only used these additional 
data sets for the investigation of the runtime of GIES. 



5.2.2 Alternative Structure Learning Algorithms 

We com pare GIES with th ree alternative greedy search algorithms. The first one is the original 
GES of iGhickerinel (|2002bh which regards the complete interventional data set as observational 



(that is, ignores the list T of an interventional data set (T, X) as defined in Equation ([2]))- The 
second one, which we call GIES-nt (for "no turning"), is a variant of GIES that stops after the 
first forward and backward phase and lacks the turning phase. The third algorithm, called GDS 
for "greedy DAG search", is a simple greedy algorithm optimizing the same score function as 
GIES, but working on the space of DAGs instead of the space of X-essential graphs; GDS simply 
adds, removes or turns arrows of DAGs in the forward, backward and turning phase, respectively. 
Furthermore, for p < 20 , we co mpare with a dynamic programming (DP) approach proposed by 
Silander and Mvllvmakil (j200fih . an algorithm that finds a global optimum of any decomposable 



score function on the space of DAGs. Because of the exponential growth in time and memory 
requirements, we could not calculate DP estimates for models with p > 30 variables. For GDS and 
DP, we examine the X-essential graph of the returned DAGs. 



5.2.3 Quality Measures for Estimated Essential Graphs 



The structural H amming distance or SHD (jTsamardinos et all 120061 : we use the slightly 



adapted version of iKalisch and Biihlmannl . 120071 ) is used to measure the distance between an 
estimated I-essential graph G and a true X-essential graph or DAG G. If A and A denote the 
adjacency matrices of G and G, respectively, the SHD between G and G reads 



SHD(G, G) r- 



l<i<j'<p 



1 - 1 



The SHD between G and G is the sum of the numbers of false positives of the skeleton, false 
negatives of the skeleton, and wrongly oriented edges. Those quantities are defined as follows. 
Two vertices which are adjacent in G but not in G count as one false positive, two vertices which 
are adjacent in G but not in G as one false negative. Two vertices which are adjacent in both 
G and G, but connected with different edge types (that is, by a directed edge in one graph, by 
an undirected one in the other; or by directed edges with different orientations in both graphs) 
constitute a wrongly oriented edge. 



5.2.4 Results and Discussion 

As we mentioned in Section 13.11 the undirected edges in the X-essential graph £x(D) of some 
causal structure D are the edges with unidentifiable orientation. The number of undirected edges 
in £x(D) analyzed in the next paragraph is therefore a good measure for the identifiability of D. 
Later on, we study the performance of GIES and compare it to the other algorithms mentioned in 
Section 15^21 
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Identifiability under Interventions 

In Figure the number of non-Z-essential arrows is plotted as a function of the number k of 
non-empty intervention targets (k = \X\ — 1, see Section f5 . 2 . 1 1) . With single- vertex interventions at 
80% of the vertices, the majority of the DAGs used in the simulation are completely identifiable; 
with target size m = 2 or m = 4, this is already the case for k = 0.6p or k = OAp, respectively. 
For the small target sizes used, the identifiability under k targets of size m is similar to the 
identifiability under k ■ m single- vertex targets. 

A certain prudence is advisable when interpreting Figure [8] since the number of orientable 
edges also reflects the characteristics of the generated DAGs. Nevertheless, the plots show that 
the identifiability of causal models increases quickly even with few intervention targets. In regard 
of applications this is an encouraging finding since it illustrates that even a small number of 
intervention experiments can strongly increase the identifiability of causal structures. 
Performance of GIES 

Figure [9] shows the structural Hamming distance between true DAG D and estimated X- 
essential graph G for different algorithms as a function of the number k of intervention targets. 
Single- vertex interventions are considered; for larger targets, the overall picture is comparable 
(data not shown). In 10 out of 12 cases for p < 20, the median SHD values of GIES and DP 
estimates are equal; in the remaining cases, too, GIES yields estimates of comparable quality — at 
much lower computational costs. 

In parallel with the identifiability, the estimates produced by the different algorithms improve 
for growing k. This illustrates that interventional data arising from different intervention targets 
carry more information about the underlying causal model than observational data of the same 
sample size. 

For complete interventions, that is, k = p, every DAG is completely identifiable and hence its 
own X-essential graph. Therefore, GDS and GIES are exactly the same algorithm in this case. 
With shrinking k, the performance of GDS compared to that of GIES gets worse. On the other 
hand, GES coincides with GIES in the observational case (k = 0). For growing k, the estimation 
performance of GES stays approximately constant; it can, as opposed to GIES, not make use 
of the additional information coming from interventions. To sum up, both the price of ignoring 
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Figure 8: Number of non-X-essential arrows as a function of the number k of intervention vertices. 
In parentheses: number of outliers in the corresponding boxplot. 
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■ Oracle 

■ GIES 

■ GDS 
□ GES 



Number of intervention targets (k) 



Figure 9: SHD between X-essential graph G estimated from n = 1000 data points and true DAG D 
as a function of the number k of single-vertex intervention targets. "Oracle estimates" denote the 
respective true X-essential graph £z(D), the best possible estimate under some family of targets X 
(see also Figure [8]). DP estimates are missing in the two lower plots. 



interventional Markov equivalence (GDS) and ignoring the interventional nature of the provided 
data sets (GES) are apparent in Figure [9j 

The performance of GIES as a function of the sample size n is plotted in Figure [10] for the 
DAGs with p = 20 vertices and intervention targets of size m = 4. The quality of the GIES 
estimates is comparable to that of the DP estimates. The behavior of the SHD values for growing 
n is a stron g hint for the consistency of G IES in the limit n — > oo (note that the DP algorithm is 
consistent; Hauser and Buhlmann . 2012). In contrast, the plots for k = and k = 4 again reveal 
the weak performance of GDS for small numbers of intervention vertices; the plots suggest that 
GDS, in contrast to GIES, does not yield a consistent estimator of the X-essential graph due to 
being stuck in a bad local optimum. 

The most striking result in Figure [TUl is certainly the fact that the estimation performance of 
GES heavily decreases with growing n as long as the data is not observational {k > 0). This is 
not an artifact of GES, but a problem of model-misspecification: running DP for an observational 
model (that is, considering all data as observational as GES does) yields SHD values maximally 
14% below that of GES (data not shown). For single- vertex interventions, the SHD values of the 
GES estimates stay approximately constant with growing n; for target size m = 2, its SHD values 
also increase, but not to the same extent as for m = 4. 

In Figure \TT\ we compare the SHD between true and estimated X-essential graphs with p = 30 
vertices for estimates produced by different greedy algorithms; other vertex numbers give a similar 
picture. In most settings, GIES beats both GDS and GIES-nt. It combines both the advantage 
of GIES-NT, using the space of interventional Markov equivalence classes as search space, and 
GDS, the turning phase apparently reducing the risk of getting stuck in local maxima of the score 
function. 

As noted in Section 15. 2, 3} the SHD between true and estimated interventional essential graphs 
can be written as the sum of false positives of the skeleton, false negatives of the skeleton and 
wrongly oriented edges. Those numbers are shown in Figure [T2l again for GIES estimates under 
single-vertex interventions for DAGs with p = 30 vertices. False positives of the skeleton are the 
main contribution to the SHD values. In 60% of the cases, especially for large n and small k, 
wrongly oriented edges represent the second-largest contribution. 
Runtime Analysis 
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Figure 10: SHD between estimated and true Z-essential graph for different numbers k of inter- 
vention targets of size m = 4 for the DAGs with p = 20 vertices. The abscissa denotes the total 
sample size n. For example, a data set with n = 1000 and k = 4 consists of 200 observational 
samples and 200 interventional samples each arising from interventions at four different targets, 
see Section [5.2. li 



All algorithms evaluated in this section were implemented in C++ and compiled into a library 
using the GNU compiler g++ 4.6.1. The simulations — that is, the generation of data and the 
library calls — were performed using R 2.13.1. All simulations were run on an AMD Opteron 8380 
CPU with 2.5 GHz and 2 GB RAM. 

Figure [13] shows the running times of GIES and DP as a function of the number p of vertices. 
GDS had running times of the same order of magnitude as GIES; they were actually up to 50% 
higher since we used a basic implementation of GDS compared to an optimized version of GIES 
(running times of GDS are not plotted for this reason). The linearity of the GIES values in 
the log-log plot (see the solid line in Figure [T3|) indicate a polynomial time complexity of the 
approximate order 0(p 2 ' 8 ), in contrast to the exponential complexity of DP; note that GIES 
also has an exponential worst case complexity (see Section l4.4p . The multiple linear regression 
log(i) = (3q + fii log(p) + @2 log(|i?|) + £, where t denotes the runtime and E the edge set of the 
true DAG, yields coefficients /3i = 1.01 and /3 2 = 0.94. 



5.3 DREAM4 Challenge 

We also measured the perfo rmance of GIES on synthetic gene ex pression data sets from the 
DREAM4 in silico challenge (jMarbach et all l20ld ; IPrill et all bold ) . Our goal here was to evalu- 
ate predictions of expression levels of gene knockout or knockdown experiments by cross-validation 
based on the provided interventional data. 



5.3.1 Data 

The DREAM4 challenge provides five data sets with an ensemble of interventional and observa- 
tional dat a simulated from five biologically plausible, possibly cyclic gene regulatory networks with 
10 genes (jMarbach et allboOflh . The data set of each network consists of 

• 11 observational measurements, simulated from random fluctuations of the system parame- 
ters (resembling observational data measured in different individuals); 

• 10 measurements from single-gene knockdowns, one knockdown per gene; 
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Figure 11: Mean SHD between estimated and true X-essential graph for different greedy algorithms 
as a function of n and k; data for DAGs with p = 30 and single-vertex interventions. Shading: algo- 
rithm yielded significantly better estimates than one ( ) or two (■) of its competitors, respectively 
(paired i-test on a significance level of a = 5%). 
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Figure 12: False positives (FP) and false negatives (FN) of the skeleton and wrongly oriented edges 
(WO; Section I5.2.3|) of the GIES estimates compared to the true X-essential graphs with p = 30 
vertices; mean values as a function of k and n for single-vertex interventions. Shading: ratio of 
each quantity and the SHD between estimated and true X-essential graph (dark means a large 
contribution to the SHD). 



• 10 measurements from single-gene knockouts, one knockout per gene; 

• five time series with 21 time points each, simulated from an unknown change of parameters 
in the first half (corresponding to measurements under a perturbed chemical environment 
having unknown effects on the gene regulatory network) and from the unperturbed system 
in the second half. 

Since our framework can not cope with uncertain interventions (that is, interventions with un- 
known target), we only used the 50 observational measurements of the second half of the time 
series. Altogether, we have, from each network, a total of 81 data points, 61 observational and 
20 interventional ones. We normalized the data such that the observational samples of each gene 
have mean and standard deviation 1. In this normalization, 95% of the intervention levels (that 
is, the expression levels of knocked out or knocked down genes) lie between —8.37 and —0.62 with 
a mean of —3.30 (Figure [bi|) . 



5.3.2 Methods 

We used each interventional measurement (20 per network) as one test data point and predicted 
its value from a network estimated with training data consisting either of the 80 r emaining data 



point s, or the 61 observational measurements alone. We used GIES, GES and PC (jSpirtes et al 



200(ih to estimate the causal models and evaluated the prediction accuracy by the mean squared 



error (MSE). We will use abbreviations like "GES(80)" or "PC(61)" to denote GES estimates 
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Figure 13: Runtime of GIES and DP as a function of the vertex number. 
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Figure 14: Standardized intervention levels in the different DREAM4 data sets. Data is scaled 
such that the observational samples have empirical mean and standard deviation 1. 



based on a training set of size 80 or PC estimates based on an observational training set of size 
61, respectively. 

For a given DAG, we predicted interventional gene expression levels based on the estimated 
structural equation model after replacing the structural equation of the intervened variable by 
a constant one; see Section 15.11 for connection between Gaussian causal models and structural 
equations, especially Equation ©. GES and PC regard all data as observational and yield an 
observational essential graph. For those algorithms, we enumerated all representati ve DAGs of the 



estim ated equivalence class using the function allDagsQ of the R package pcalg (jKalisch et al 



calculated an expression level with each of them, and took the mean of those predictions. 



GIES (80) yields a single DAG in each case since the 19 interventional measurements in the training 
data ensure complete identifiability. 

Furthermore, we used the evaluation script provided by the DREAM4 challenge to assess the 
quality of our network predictions to those sent in to the challenge by participating teams. This 
evaluation is based on the area under the ROC curve (AUROC) of the true and false positive rate 
of the edge predictions. 



5.3.3 Results 

Figure [T5l shows boxplots of MSE differences between GIES(80) and its competitors; that is, we 
consider quantities of the form 

AMSE comp := MSE comp — MSE GIES ( 80 ), (7) 

where comp stands for one of the competitors. Since the MSE differences showed a skewed distri- 
bution in general, we used a sign test for calculating their p-values. 

Except for one case (PC(61) in network 1), GIES(80) always yielded the best predictions of 
all competitors. Although all data sets are dominated by observational data (61 observational 
measurements versus 20 interventional ones), GIES can make use of the additional information 
carried by interventional data points to rule out its observational competitors. On the other hand, 
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Figure 15: Upper row: MSE values of GIES and competitors; lower row: differences of MSE values 
as denned in Equation 0; large values indicate a good performance of GIES. (A) GIES (80), (B) 
PC(80), (C) PC(61), (D) GES(80), (E) GES(61). Numbers below the boxplots: p-values of a 
one-sided sign test. 

the dominance of observational data is probably one of the reasons for the fact that GIES does 
not outperform the observational methods more clearly but has an overall performance which is 
comparable with that of its competitors. Another reason could be the fact that the underlying 
networks used for data generation are not acyclic as assumed by GIES. Interestingly, the winning 
margin of GIES in network 5 was not smaller than in other networks although the corresponding 
data set has the smallest intervention levels (in absolute values; see Figure [T4"|) . 

29 teams participated in the DREAM4 challenge. Their AUROC values are available from 
the DREAM4 website^] adding our values gives a data set of 30 evaluations. Among those, 
our results had overall rank 10, and ranks 8, 4, 21, 10 and 3, respectively, for networks 1 to 5. 
Except for network 3, we could keep up with the best third of the participating teams despite the 
beforementioned model misspecification given by the assumption of acyclicity, and despite the fact 
that we ignored the time series structure and half of the time series data. 



6 Conclusion 

We gave a definition and a graph theoretic criterion for the Markov equivalence of DAGs under 
multiple interventions. We characterized corresponding equivalence classes by their essential graph, 
defined as the union of all DAGs in an equivalence class in analogy to the observational case. 
Using those essential graphs as a basis for the algorithmic representation of interventional Markov 
equivalence classes, we presented a new greedy algorithm (including a new turning phase), GIES, 
for learning causal structures from data arising from multiple interventions. 

In a simulation study, we showed that the number of non-orientable edges in causal structures 
drops quickly even with a small number of interventions; our description of interventional essential 
graphs makes it possible to quantify the gain in identifiability. For a fixed sample size n, GIES 
estimates got closer to the true causal structure as the number of intervention vertices grew. For 
DAGs with p < 20 vertices, the GIES algorithm could keep up with a consistent, exponential- 
time DP approach maximizing the BIC score. It clearly beat GDS, a simple greedy search on 
the space of DAGs, as well as GES which cannot cope with interventional data. Our novel turn- 
ing phase prov ed to be an improv ement of GES even on observational data, as it was already 
conjecture d by Chickering ( 2002~bl ). Applying GIES to synthetic data sets from the DREAM4 



challenge (jMarbach et al.1 . 12010 ) . we got better predictions of gene expression levels of knockout 
or knockdown experiments than with observational estimation methods. 

The accurate structure learning perfor mance of GIES in the limit of large data sets raises 



the question whether GIES is consistent. Chickering ( 2002bl ) proved the consistency of GES on 



observational data. However, the generalization of his proof for GIES operating on interventional 
data is not obvious since such data are in general not identically distributed. 



1 DREAM4 can be found at http://wiki.c2b2.columbia.edu/dream/index.php/D4c2 
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A Graphs 



In this appendix, we shortly summarize our notation (mostly following IAndersson et ah . 1997) and 



basic facts concerning graphs. All statements about perfect elimination orderings that are used in 
Sections [3] and H] are listed or proven in Section lA~2l 

A.l Definitions and Notation 

A graph is a pair G = (V,E), where V is a finite set of vertices and E C E*(V) := (V x V) \ 
{(a,a)\a G V} is a set of edges. We use graphs to denote causal relationships between random 
variables Xi, . . . , X p . To keep notation simple, we always assume V = {1, 2, . . . , p} =: \p], in order 
to represent each random variable by its index in the graph. 

An edge (a, b) G E with (6, a) G E is called undirected (or a line), whereas an edge (a, b) G E 
with (b, a) £ E is called directed (or an arrow). Consequently, a graph G is called directed (or 
undirected, resp.) if all its edges are directed (or undirected, resp.); a directed graph is also called 
digraph for short. We use the short-hand notation 

a^beG :«4> (a,b) G E A (b,a) E, 
a — beG (a,b) G E A (6, a) G E, 

a &GG :<S> (a,b) G EV (b,a) G E. 

A subgraph of some graph G is a graph G' = (V',E') with the property V C V, E' C E, 
denoted by G' C G. For a subset A C V of the vertices of G, the induced subgraph on A is 
G[A] := (A, E \A]), where -E[^4] := E n (A x A). A v-structure (also called immorality by, for 
example, Lauritzen, 19961 ) is an induced subgraph of G of the form a — > 6 < — c. The skeleton 



of a graph G is the undirected graph G u := (V,E U ), E u := {(a, b) e V xV \ a b G G}. For 

two graphs G\ = (V,Ei) and G2 = (V,^) on the same vertex set, we define the union and the 
intersection as G\ U G2 := (V,Ei U E 2 ) and G\ n G2 := (V, #1 fl £2), respectively. For a graph 
G = (V,-E) and (a, 6) G #*(V), we use the shorthand notation G - (a,b) := (V,E \ {(a, 6)}) and 
G + (a,b):=(V,EU{(a,b)}). 

The following sets describe the local environment of a vertex a in a graph G: 



pa G (a) 
ch G (a) 
ne G (a) 
ad G (a) 



{b G V I b — > a G G}, the parents of a, 

{& G V I a — > 6 G G}, the children of a, 

{6 G V I a — 6 G G}, the neighbors of a, 

{b G V I a 5 G G}, the vertices adjacent to a. 



The subscripts "G" in the above definitions are omitted when it is clear which graph is meant. 
For a set A C V of vertices, we generalize those definitions as follows: 

pa G (A) := |J pa G (a) \ A, ne G (A) := (J ne G (a) \ A, etc. 

a£A a£A 

The degree of a vertex a £ V is defined as deg G (a) := | ad G (a)|. 

For two distinct vertices a and b G V, a chain of length k from a to 6 is a sequence of distinct 
vertices 7 = (a = ao, ai, . . . ,a& = 6) such that for each i = 1, . . . , fc, either aj_i — > a, G G or 
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5 ► 6 7 



Figure 16: A chain graph G with three chain components A = Tq(X) = [1]« G = {1)2,3,5}, 
B = Tq(6) = {6} and C = ?g(4) = {4, 7}. The arrows induce the partial order A <g B, A <g C. 
The graph is no chain graph anymore when we replace the arrow 3 — > 4 by a line since this would 
create a directed cycle: (3, 7, 4, 3). 



a,_i < — a,i G G; if for all i, (flSi-i, a{) G E (that is, aj_i — > a, G G or aj_i — aj € G), the sequence 
7 is called a path. If at least one edge aj_i — > dj is directed in a path, the path is called directed, 
otherwise undirected. A (directed) cycle is defined as a (directed) path with the difference that 
ao = a n . Paths define a preorder on the vertices of a graph: a <g b 3 & path 7 from a to 6 in 
G. Furthermore, a «g 6 :<^ (a 6) A (b <q 0) is an equivalence relation on the set of vertices. 

An undirected graph G = (V, E) is complete if all pairs of vertices are adjacent. A clique is a 
subset of vertices C C V such that G[C] is complete; a vertex a G V is called simplicial if ne(a) is 
a clique. An undirected graph G is called chordal if every cycle of length k > 4 contains a chord, 
that means two nonconsecutive adjacent vertices. For pairwise disjoint subsets A,B,S C V with 
A ^ and i? ^ 0, A and -B are separated by S in G if every path from a vertex in A to a vertex 
in B contains a vertex in S. 

A directed acyclic graph, or DAG for short, is a digraph that contains no cycle. In the 
paper, we mostly use the symbol D for DAGs, whereas arbitrary graphs are, as in this appendix, 
mostly named G. Chain graphs can be viewed as something between undirected graphs and DAGs: 
a graph G = (V, E) is a chain graph if it contains no directed cycle; undirected graphs and DAGs 
are special cases of chain graphs. The equivalence classes in V w.r.t. the equivalence relation ~c 
are the connected components of G after removing all directed edges. We denote the quotient set 
of V by T(G) := Vj ~g> and its members T G T(G) are called chain components of G. For a 
vertex a G V, Tq{o) stands for [a]^ G - The preorder on V induces in a canonical way a partial 
order on T(G) which we also denote by ^g- Tg(gl) Tq(Jo) :44> a ^g b. An illustration is shown 
in Figure [16j 

An ordering of a graph is a bijection [p] — > V, hence, since we assume V = [p] here, a 
permutation a G S p . An ordering a canonically induces a total order on V by the definition 
a <„ b (T _1 (a) < <7~ 1 (6). An ordering a = (vi,...,v p ) is called a perfect elimination 
ordering if for all i, Vi is simplicial in G u [{vi, ...,«$}]. A graph G = (V, E) is a DAG if and only 
if the previou sly defined preor der <g is a partial order; such a partial order can be extended to 
a total order ( Szpilrain . 1930l ). Thus every DAG has at least one topological ordering, that 
is an ordering a whose total order < CT extends <g'- a b a < a b. For a G S p , a DAG 
D = ([p],E) is said to be oriented according to a if a is a topological ordering of D. In a 
DAG D with topological ordering a, the arrows point from vertices with low to vertices with high 
ordered indices. The vertex o~(l) is a source, that means all arrows point away from it. 



A. 2 Perfect Elimination Orderings 

Perfect elimination orderings play an important role in the characterization of interventional 
Markov equivalence classes of DAGs as well as in the implementation of the Greedy Interven- 
tional Equivalence Search (GIES). In this section, we provide all results for this topic that are 
used as auxiliary tools in the proofs of Sections [3] and HI 

Lemma 37 Let D = (V, E) be a DAG. D has no v-structures if and only if any topological ordering 
of D is a perfect elimination ordering. 



30 



Input : An undirected graph G = (V, E) 

Output: An ordering a of the vertices V, called a LExBFS-ordering 

£ <— (V); // Initialize sequence E of vertex sets to contain the single set V in the beginning 
a <— (); // Initialize output sequence of vertices 
while £ ^ do 

Remove a vertex a from the first set in the sequence £; 
if first set o/E is empty then remove first set from E; 
Append a to er; 

Mark all sets of E as not visited; 
foreach b G nec(a) s.t. b G S for some S G E do 
if S not visited then 

Insert empty set T into E in front of 5; 
Mark S as visited; 

else let T be the set preceding S in E; 
Move b from S to T; 
if S = then remove S 1 from E; 



Algorithm 6: Lex BFS(V, £7) . Lexicograph i c bre adth-first search in the so-called 
tioning paradigm" (|Rose et all Il976l : ICorneill . 120041 ) 



'parti- 



The proof of this lemma follows easily from the definitions of a v-structure and a perfect elimination 
ordering. Moreover, if any topological ordering of a DAG is a perfect elimination ordering, this is 
automatically the case for every topological ordering. 



Proposition 38 (Rosel . Il970h Let G = (V, E) be an undirected graph. Then G is chordal if and 



only if it has a perfect elimination ordering. 

Perfect elimination orderings of chordal graphs can be produced by a variant of the breadth- 
first search algorithm, the so-called lexicographic breadth-first search (LexBFS; see Algorithm [6]) . 
The term "lexicographic" reflects the fact that the algorithm visits edges in lexicographic order 
w.r.t. the produced ordering a. 



Proposition 39 (|Rose et all 1 19761 ) Let G = (V,E) be an undirected chordal graph with a 



LexBFS -ordering a. Then a is also a perfect elimination ordering on G. 

Corollary 40 Let G be an undirected chordal graph with a LexBFS -ordering a. A DAG D C G 
with D u = G that is oriented according to a has no v-structures. 

Corollary 00] is a consequence of Lemma [33 and Proposition [3SJ According to this corollary, 
LExBFS-orderings can be used for constructing representatives of essential graphs (see Proposition 
[T6]) . Corollary 00] as well as Algorithm [6] are therefore of great importance for the proofs and 
algorithms of Sections [3] and [H 

Figure [T7] shows an undirected chordal graph G and a DAG D that has the skeleton G and is 
oriented according to a LExBFS-ordering a of G. The functioning of Algorithm [6] when producing 
a LExBFS-ordering on G is illustrated in Table [TJ Note that the "sets" in E are written as tuples. 
We use this notation to ensure that we can always remove the first (leftmost) vertex from the first 
"set" of S (line H] in Algorithm [6|) , and that we keep the relative order of vertices when moving 
them from one set S to the preceding one, T, in S (line Q3] in Algorithm [DJ) . Throughout the text, 
we always assume an implementation of Algorithm [6] in which the data structure used to represent 
the "sets" in the sequence S guarantees this "first in, first out" (FIFO) behavior. In particular, the 
start sequence (t>i, v%, . . . ,v p ) of the vertices in V provided to the algorithm determines the vertex 
the LExBFS-ordering a := LexBFS((i>i, . . . ,v p ),E) starts with: o~(l) = V\. It is often sufficient 
to specify the start order of LexBFS up to arbitrary orderings of some subsets of vertices. For a 
set A = {ai, . . . , afc} C V and an additional vertex v € V \ A, for example, we use the notation 

LexBFS({A,v,V\(A(J{v})),E), or even LexBFS((j4, «,...), E) 
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5 6 7 

(a) G 



5 < 6 7 

(b) D 



Figure 17: An undirected, chordal graph G = ([7], E) and the DAG D we get by orienting all edges 
of G according to the ordering a := LexBFS((6, 3, 1, 2, 4, 5, 7), E). 
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Table 1: State of the sequences S and a after the i th run (z = 0, . . . , 7) of the while loop (lines [3] 
to [T4"|) of Algorithm E] applied to the graph G of Figure [T71 with start order (6,3,1,2,4,5,7). 



to denote a LExBFS-ordering produced from a start order of the form (oi, . . . , o^, v, . . .), without 
specifying the orderings of A and V\(A\J {v}). 

By using appropriate data structures (for example, doubly linked lists for the representation 
of £ and its sets, and a pointer at each v ertex pointing to the set in £ in which it is contained), 
Algorithm [6] has complexity 0(\E\ + |V|) (jCorneil liooi ) . 

For the rest of this section, we state further consequences of Lemma [37] and Proposition [39] 
which are relevant for the proofs of Sections and [H 



Corollary 41 Let G = (V,E) be an undirected chordal graph, and let a — b G G. There exist 
DAGs D\ and D2 with D\, D2 C G and = DV; = G without v-structures such that a — > b G D\ 
and a< — b G D2. 

Proof Set a x := LExBFS((a, V \ {a}),E) and a 2 := LexBFS((6,F \ {b}),E), and let D t and 
D2 be two DAGs with skeleton G and oriented according to o~\ and 02, resp. Then, by Corollary 
[4*0l Di and D2 have the requested properties; in particular, all edges point away from a in D\, 
whereas all edges point away from b in D 2 - M 



Corollary 42 (jAndersson et all 119971 ) Let G = (V, E) be an undirected chordal graph, a G V 
and C C ne(a). Then there is a DAG D C G with D u = G and {b G ne(a) | b —> a G D} = C that 
has no v-structures if and only if C is a clique. 

Proof "=>": Assume that there are non-adjacent vertices b,c G C. Then, b — a — c is an induced 
subgraph of G, and by construction, the same vertices occur in configuration b — > a < — c in D, 
which means that D has a v-structure, a contradiction. 

"<^=": Let (ci, . . . ,Cfc) be an arbitrary ordering of C. Run LexBFS on a start order of the form 
(ci, . . . , Cfe, a, . . .). After the first run of the while loop (lines [3] to [H] of Algorithm [6]) , a = (ci), 
and the first set in the sequence S contains (C U {a}) \ {ci} as a subset (all vertices in this set are 
adjacent to ci), in an unchanged order c 2 , ■ ■ ■ ,Ck,a due to our FIFO convention. After the second 
run of the while loop, a = (c\, C2), and the first set in X contains (C U {a}) \ {ci, C2}, and so on. 
In the end, we get a LEXBFS-ordering of the form a = (c%, . . . , c^, a, . . .). Orienting the edges of 
G according to a yields a DAG with the requested properties by Corollary [40] ■ 
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Figure 18: Configuration of vertices in Proposition 1431 



Proposition 43 LetG = (V,E) be an undirected, chordal graph, a — b G G, and C C nec(a)\{fr} 
a clique. Let N := nec(a) nnec(^), and assume that CdN separates C\N and N\C in G[neo(a)} 
(see Fiaure [TS\) . Then there exists a DAG D C G with D u = G such that 

(i) D has no v- structures; 

(ii) all edges in D[C U {a}} point towards a; 

(Hi) all other edges of D point away from vertices in C U {a} (in particular, a — >b £ D); 
(iv) b — > d G -D for all d G P := ne G (6) \ (CU {a}). 

Proof Set a := LexBFS((C, a, b, . . .), E), and let D be the DAG that we get by orienting the 
edges of G according to a. As in Corollary | 



properties (i) 
y d G P i: 
W.l.o.g., we can assume C 



to 



ni 



are met. 



It remains to show that b occurs before any d G P in a (that means b < CT d V d G P) in order 

= {1, 2, . . . , k}, a = k + 1 and b = k + 2. 
The start order of the vertices for LexBFS is then (1, 2, . . . ,p). Due to the FIFO convention for 
the sets of the sequence E in Algorithm [6l b always precedes any d G P whenever they appear 
in the same set; hence we only must show that the set containing b is never preceded by a set 
containing some d G P in E. 

Suppose, for the sake of contradiction, that this is the case for some d G P; name v% := d. At 
the beginning, b is in the same set as v\ in the sequence S; there is some vertex vi that forces 
LexBFS to move v\ into the set preceding the one containing b. A careful inspection of Algorithm 
[6] shows that v-i is the vertex which is minimal w.r.t. < a in 



S(yi) := {v G V | v G ne G (vi) \ ne G (b),v < a b}. 



If V2 > b (that is, if i>2 ^ C U {a} due to our convention), V2, as v%, always follows b whenever they 
are in the same set in E. Therefore, V2 < a b implies that there is some vertex V3 that moves V2 in 
the set preceding the one of b in E during the execution of LexBFS; as before, we see that this is 
the vertex which is minimal w.r.t. < CT in S(v2). 

We can now continue to construct this sequence Uj+i := mm S(vi) (always taking the minimum 
w.r.t. < CT ) until we find some vertex v m < b; this is a vertex in C U {a}. Even more, v m G C \ N , 
since, by definition of S(t> m _i), we only consider vertices that are not adjacent to b. We now have 
constructed a path 7 = (vi,..., v m ) of length m > 2 in G such that V\ G P, v » ^ nee (b) V i > 1, 
Vi > b V i < m and v m G C \ N; furthermore, we have v m < a . . . < a v\ < a b. The path 7 can be 
elongated to a cycle (a, vq := b, v\, V2, ■ ■ ■ , v m , a): 




v m -i 



We now claim that Vi — a G G for all < i < m. This is clearly the case for i = and i = m by 
construction. Assume, for the sake of contradiction, that there is some i, < i < m, that is not 
adjacent to a. Let r be the largest index smaller than i such that v r — a G G and s be the smallest 
index larger than i such that v s — a G G. Then the following is an induced subgraph of G: 
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a 




'r 



Note that a chord between different v^s, say, a chord of the form vi — vi+h with h > 2, would 
violate the minimality of vi + \ in the set S(yi). This means that G contains an induced cycle of 
length 4 or more, contradicting the chordality. 

This proves the claim that Vi — a 6 G for all < i < m, or, in other words, Vi € nec(a) for 
all < i < m. Hence v\ € N \ C, and 7 is a path from N \ C to C \ N in G[nec(a)] that has no 
vertex in C n N, in contradiction with the assumption. ■ 



Proposition 44 Let G = (V, E) be a chain graph with chordal chain components that does not 
contain a — > b — c as an induced subgraph, and let D C G be a digraph with D u = G u . D is 
acyclic and has the same v- structures as G if and only if D[T] is oriented according to a perfect 
elimination ordering for each chain component T € T(G). 

Proof "=>■": let T € T(G). G[T] obviously does not have any v-structures, hence D[T] has no 
v-structures, either. It follows from Lemma [371 that D[T] must be oriented according to a perfect 
elimination ordering. 

"<=": for each T G T(G), D[T] is acyclic by construction. Assume that D has some directed cycle 
7; this cycle must reach different chain components of G, so it contains at least one edge a — >6 that 
is also present in G. Because of D C G and D u = G", 7 is also a cycle in G; and since a — > 6 € G, 
it is even a directed cycle in G, a contradiction. So D is acyclic. 

By construction, every v-structure in G is also present in D. Suppose that D has some v- 
structure a — > b < — c that G has not. a, 6 and c cannot belong to the same chain component of G 
according to Lemma [371 So, w.l.o.g., a — >6 — c must be an induced subgraph of G, contradicting 
the assumption. Hence D and G have the same v-structures. ■ 



B Proofs 

In this appendix, the technically interested reader finds all proofs that were left out in Sections [2] 
to H] for better readability. 

B.l Proofs for Section H 

We start with the proof of Lemma [5] which motivates Definition [7J by showing that, for some DAG 
D and some (conservative) family of targets I, the elements of M%(D) are exactly the density 
tuples that can be realized as interventional densities of a causal model with structure D. Note 



that we use the conservativeness of X only in the proof of point (ii) ; it can even be proven without 
assuming conservativeness, although the proof becomes harder. 
Proof of Lemma [8] 

(i) f(x\ do(X/ = Uj)) obeys the Markov property of (Section l2.ip . Furthermore, for I, J £ I 
and a £ I U J, we have 

f( x a I x paD(a) ;do(X/ = Uj)) = f(x a j x p&o{a) ) = f(x a \ x pai?(a) ; do(Xj = Uj)) 

by the truncated factorization of Equation (pQ) . 
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(ii) Let a G [p\. Since I is conservative, there is some I G Z such that a ^ I. Define 
x ai x pa, D (a)) :_ / i x a\ x pa, D (a))- Note that, due to Definition [71 the function h a does 
not depend on the choice of I. 
Let f(x) := ria=iM 

x ai x pa D (a) )i this is a positive density on X with f( x a\ x pa, D (a)) — 
h a ( x a> x pa , D (a))] hence / G M(D) and (D,f) is a causal model. 

By defining level densities fi(xi) := Ilie// i x i)i we can construct an intervention setting 
S := {(J, with the requested properties. ■ 

The proof of the main result of Section [21 the graph theoretic criterion for two DAGs being 
interventionally Markov equivalent (Theorem I10p . requires additional lemmas. 

Lemma 45 Let D be a DAG, I a family of targets and I G I a target in this family. Define 

M^\D) := {/« | (f^)j e x G Mi(D)} , 

the projection of Aix(D) to the density component associated with the intervention target I. Then, 
MW(D)=M(dM). 

Proof The inclusion "c" is immediately clear from Definition [7J It remains to show "D" . 

Let / G Ai(D^). Since D^ 1 ' C D, / also obeys the Markov property of D; this means 
/ G M(D). Set fi(xj) := f(xi); since / G M(D^), the components of // are independent. For 
J £ I, J / /, let fj be an arbitrary level density on Xj. By Lemma [ ^i)[ we know that, for 
intervention variables Uj ~ fj (J G I), 

(/(• | do D (Xj = [/,))) JeJ G.Mx(D) , 

hence /(• | do£>(X/ = Ui)) G .M^(-D) by definition of M.^'{D). Moreover, by construction of //, 
we have f(x | doi)(X/ = Ui)) = f(x) and hence / G M^\D). ■ 



Lemma 46 Let D be a DAG, f G M(D), and Ac [p\. Then, 

I| f( x a I ^pa(a)) = f{ x A | x p a (A))- 
aGA 

Proof Let a G <Sp be a topological ordering of .D. Then, for a G A, 

pa(a) C pa(A) U [A n <7 _1 ({1, . . . , a - 1})] (8) 

holds: every 6 G pa(a) either lies in A c and hence in pa(^4) by the definition given in Appendix 
IA.ll or in A n ct _1 ({1, . . . , a — 1}) by the definition of a topological ordering. 
Hence we conclude 

f( x A | £pa(A)) = n f( x a \ ^An - 1 ({l,-,a-l})> x pa.(A)) = /(»o I x pa.(a))] 
aGA aGA 

the first equality is the usual factorization of a density, the second equality follows from the Markov 
properties of / and Equation flHJ). ■ 



Lemma 47 Let X be a family of targets. Assume D\ and D 2 are DAGs with the same skeleton 
and the same v-structures such that D± and D% have the same skeleton for all L G T. Moreover, 
let a — >b G D\. If there is some I (El such that \I n {a, b}\ = 1, then the arrow is also present in 
D 2 : a^b G D 2 . 
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Proof Since D\ and D 2 have the same skeleton, we have at least a b G D 2 . Suppose a< — 6 £ D 2 . 

If a E J, b ^ I, a and 6 are adjacent in Z?j , but not in Z)^, hence and have a different 
skeleton, a contradiction. On the other hand, if a £ I but b € I, a and 6 are not adjacent in D± , 
but in D 2 \ a contradiction, too. ■ 



Proof of Theorem [10] f^j =>• Let I G X, and let Al (/) (£>i) and 7W (/) (D 2 ) be defined as 



in Lemma H5l By Definition [9] of interventional Markov equivalence, it follows that M^\D{) 
M^(D 2 ); hence M(D^ ] ) = M(D { 2 I] ) by Lemma [ 



(a) =4 


> (Hi): 


(Hi) = 


4> (iv) 



(Hi): this implication follows from Theorem 



st a — > 6 E -Di be an arrow. Since X is conservative, there is some I G X such that 

b £ I. For this /, a — > 6 G Z)[ , so a 6 G by assumption and hence a b G D2 because of 

D { 2 1] C D 2 . Similarly, we can show the implication a — > b G D 2 =>■ a & G Di , what proves that 

D\ and Z?2 have the same skeleton. 

It remains to show that D\ and D 2 also have the same v-structures. Let a — > b < — c be a 
v-structure of D\ . There is some Z G X that does not contain b; a — > b < — c is then an induced 
subgraph of D± an d hence by assumption also of D 2 . By consequence, a — > 6 < — c is also an 
induced subgraph of Z>2 since Z>2 has the same skeleton as D% . The argument is of course symmetric 
w.r.t. exchanging D\ and D 2 . 



JtiiJ\^l(i)\- Let {f {I) )i e i G MxiPx). By Lemma Bp)] there is some density / G M{D{) and some 



intervention setting S = {(I, /j)}jex such that /^(-) = /(-| do/^pT/ = L/j)) for random variables 
*7/~//,ZgX. 

The truncated factorization in Equation (pQ) tells us 

/(* 1 d ODl (Xj = uj)) = n f( Xa 1 x p (a) ) n fdx a ) = ax) n — - 



7(^/ I Zpa^tf)) 



/(*) „ /f(X/) v (9) 



The last step uses Lemma [ 

We now claim that pa Dl (I) = pa£) 2 (I). Indeed, if b G I and a G pa 5l (6) \ 7, a — * b is an arrow 
in D\ with |Z n {a, b}\ = 1, hence a — > b G -D2 by Lemma H71 and therefore a G pa£> 2 (Z); the argu- 
ment is symmetric w.r.t. exchanging D\ and D 2 . It follows that /(x/|x pa£) (/j) = /(x/|aj pa£| (/)), 
and by repeating the calculation in ([9]) for D 2 instead of D\, we find f(x\doD 1 (Xi = Uj)) = 
f{x\do r>2 {X I = U I )). 

Since this equality is true for all Z G X, we have / (/) (-) = /(-| doz) 2 (X/ = 17/)) for all I G X, 
so (f^)iez G Mx(D 2 ) by Lemma H^I)j which proves A^x(Z?i) C Mx(D 2 ). The other direction is 
completely analogous. ■ 



Points (i) to (iii) are even equivalent under non-conservative families of targets. The proof is 



more difficult in this case though. 



B.2 Proofs for Section H 



All state ments of Section [3721 are similar to analogous statements for the observational case devel- 



oped by lAndersson et al.l ([19971 ) . Some of the proofs given there are even literally valid also for 



our interventional setting; in such cases, we will not repeat them here, but just refer to the original 
ones. However, in most cases, the generalization from the observational to the interventional case 
is not obvious and requires adapted techniques presented in this section. Here, X always stands 
for a conservative family of targets. 
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First, we show that for some DAG D, £i(D) is a chain graph (Proposition ll5p . For that purpose, 
we define £x(D)* as the smallest chain graph containing £x(D). £x{D)* is obt ained from £t(D) 



by co nverting all arrows that are part of a directed cycle in £x(D) into lines ([Andersson et al 
19971 ) . We first state a couple of properties of £x(D) and £x(D)* (Lemma I48p. and then show that 
ExiP)* = £x(D) (Proposition d5]). 



Lemma 48 (adapted from Andersson et al. . 19971 ) Let D be a DAG. Then 



(i) £x{D) has no induced subgraph of the form a - 

(ii) If £x(D) has an induced subgraph of the form 



c. 



then there exist D±,D 2 G \P\x such that 



b CDi, 



b CD 2 . 



(Hi) £x{D)* has the same v-structures as D (and hence as £x{D)). 

(iv) £x{D) o-nd £x(D)* do not have any undirected chordless k-cycle of length k > 4. 

(v) £x(D)* has no induced subgraph of the form a — > b — c. 

(vi) If two vertices a and b are adjacent in £x(D)* and there is some I £ X such that \In {a, b}\ = 
1, then the edge between a and b is directed in £x{D) o,nd £x(D)* . 

Proof Points (i) to (v) correspond to Facts 1 to 5 of Andersson et al. ( 19971 ) where these properties 
were proven for observational essential graphs. A thorough inspection of the proofs given there 
reveals that they only make use of the fact that two Markov equivalent DAGs have the same 
skeleton and the same v-struc tures, which is a l so tru e in the interventional case by Theorem 1 1 01 
Thanks to this, the proofs of lAndersson et al.1 (|1997l ) can be literally used here. (Note that the 
inverse implication also holds in the observational case, but not in the interventional one; see the 
discussion after Theorem 1 1U1) 

The edge between a and b in £x(D) is directed since the arrow 
It remains to show that the edge is also 



vi 



It remains to prove point 
between a and b is X-essential in D by Corollary [T3J 
directed in £x(D)* , that is, to show that it is not part of a directed cycle in £x(D). 

Let's suppose, for the sake of contradiction, that the edge between a and b is part of a directed 
cycle 7 = (a, b = bo, b\, . . . , = a) in £x(D). W.l.o.g., we can assume that a — > b € £x(D), and 
that 7 is the shortest such cycle containing a directed edge with one end point in / and the other 
one outside I. 

Case 1: k = 2. Then 7 is of the form 



since two or three directed edges would imply the existence of a digraph with a cycle in the 

D 2 in [D] 2 

b CD 2 . 



equivalence class of D. By point (ii) , there are DAGs D\ and D 2 in [D]x such that 

a > b C D\ , a 



The condition \I n {a, b}\ = 1 leaves four possibilities: 



a) a £ I; b, b± ^ X then, a 



b CD 



1 ' 



b CD 



2 ' 
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b) a, bi G J; b £ I: then, a > b C d[ I] , a > 6 C D$p ; 

c) b e I; a, b 1 (£ I: then, a 6 C £>f } a 6 C ; 

X 6 X 61 

d) 6, 61 G I; a ^ J: then, a 6 C L>J J) a 6 C £^ . 
In all four cases, (Df)" / {D^Y, hence D : ^ 2 L> 2 , a contradiction. 

Case k > 3. Let i be the smallest index such that bi — frj+i G £x(D) (there must be such an 
index, otherwise 7 would be a directed cycle in D). 



we must 



Case 2.1: i = 0. Since a — >b — b\ cannot be an induced subgraph of £x(D) by point 

have a b\ G £x(D). More precisely, we must have a — > b\ G £x(D), otherwise (a, b, bi, a) would 

form a shorter directed cycle than 7, in contradiction to the assumption. This means that there 
exist DAGs Di,D 2 G [Z% such that 

a > b C Di, a > b C D 2 . 



h 61 
Again, the condition \I n {a, 6}| = 1 leaves four possibilities: 

a) a el; b, bi <£ I: then, a- — > 6 C L>f ) , a - — > 6 C 



2 ' 



6) a,hel;b^ I: then, a > 6 C £>J J) , a —3 6 C D { 2 1] ; 

c) be I; a, b 1 I: then, a 6 C a b C ; 

d) b, bi G I; a $ I: then, a 6 C d[ I] a b C . 

Cases and c) are not compatible with the condition (d[^) u = (D^) u . In cases a) and d), the 
arrow a — ► b\ is part of a directed cycle (a, 61, b 2 , ■ ■ ■ , i>& = a), furthermore |J n {a, = 1; this 
contradicts the assumption of minimality of the larger cycle 7. 

Case 2.2: i > 1. Since — > bi — bi + \ cannot be an induced subgraph of £x(D), we must have 
bi-i bi + \ G £x(D). Either bi-i < — bi + \ G £x(D), that is 

> h C 5i(£>) , 

h+i 

which would imply the existence of a digraph with a directed 3-cycle in the equivalence class of 
D, a contradiction. The other cases are — > G £xiP) or — bi+i e £x{D) which would 
mean that a — > b would be part of a shorter directed cycle (a, b = bo, . . . , . . . , = a), 

contradicting the assumption of minimality of the cycle 7. ■ 
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Proof of Proposition 1151 We only prove the first point; the second one is an immediate conse- 
quence of Lemma 111 Fiv) We have to show that £x(D) = £j(D)* , that means that 



a — b££x(D)* => a — b(££ x (D). 



By Lemma HHiv) all chain components of £%{D)* are chordal. Let D\ and D2 be two DAGs that 



are obtained by orienting all chain components of £z(D)* according to some perfect elimination 
ordering, such that a — >b £ D\ and a < — b £ D2; such orientations exist by Proposition 05] and 
Corollary SJJ 

We now claim that D\ ~j D2 by verifying the criteria of Theorem [1C pvj it then follows that 
a — be £ X {D) because of D 1 UD 2 <Z £x(D): 

• By Proposition 1441 D\ and D2 have the same skeleton and the same v-structures. 

• Dip and dP have the same skeleton for all I £ X: suppose, for the sake of contradiction, 



that (dP) u has some edge c — d that {D^'Y has not. W.l.o.g., we then have c — >d € Di, 
c < — d £ D2, c € I, d £ I. But then c and d are adjacent in £x(D)* with |J n {c, d}| = 1, 
hence the edge between c and d must be oriented in £%{D)* by point (vi) of Lemma |4"5[ and 



hence it is not possible that this edge has two different orientations in D\ and D2 by their 
construction. H 



Proof of Proposition [TBI "<=": By the construction of £x(D), we know that D C £x(D) and 
D u = £x(D) u . Furthermore, D has the same v-structures as £x{D). Let D' be another digraph 
that is obtained by orienting all chain components of £x(D) according to a perfect elimination 
ordering; by Proposition EHl D' is acyclic and has the same v-structures as £x(D) and hence as 
D. It remains to show that D^ and JD'W have the same skeleton for all J £ X; this can be done 
similarly to the proof of Proposition [151 

"=$■": let D' be a DAG with D' ~i -D- In particular, D' and D have the same skeleton and the 
same v-structures, so D' also has the same skeleton and the same v-structures as £x{D). It follows, 
with Proposition |4"4"1 that D' is oriented according to a perfect elimination ordering on all chain 
components of £x(D). ■ 



Lemma 49 Let D be a DAG and a — > b an Z-essential arrow in D. Then a — > b is strongly 
I-protected in £%{D). 

This lemma is an auxiliary result needed to prove Theorem [T5J In its proof, we first show the 
weaker statement that every Z-essential arrow of D is X-protected in £x(D). 

Definition 50 (Protection) Let G be a graph. An arrow a — > b £ G is X-protected in G if 

there is some intervention target I El such that \I n {a, b}\ = 1, or pa G (a) 7^ pa G (6) \ {a}. 

This definition is again a generalization of the notion of protection of Andersson et al. 

for I = {0}, we gain back their definition. A strongly X-protected arrow (Definition I14p is also 

X-protected. In a chain graph G, an arrow a — > b is X-protected if and only if there is some I £ X 

such that I J n {a, b}\ = 1, or the arrow a — > 6 occurs in at least one subgr aph of the form (a), fb) , 

(c) in the notation of Definition [TH or in a subgraph of the form (d') ( Andersson et all 19971 ). 

where 

(d') : a { b . 



Proof of Lemma 1491 As fo resha dowed, we prove this lemma in two steps, corresponding to Facts 



6 and 7 of Andersson et al. I (jl997ft : in a first step, we show that a — > b must be X-protected, in a 
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second step, we strengthen the result by showing that it must even be strongly X-protected. For 
notational convenience, we ab breviate G := £t ( D)- W e skip some steps of the proof that can be 
literally copied from proofs in Andersson et al. ( 19971 ). 



Suppose, for the sake of contradiction, that a — >6 is not X-protected. Let D\ be a digraph that 
is gained by orienting all chain components of G according to a perfect elimination ordering, where 
the edges of Tq(o) and Ta(b) are oriented such that all edges point away from a or b, respectively. 
Then D\ is acyclic and X-equivalent to D by Proposition [161 

Let D2 be another digraph, differing from only by the orienta tion of the edge between a 



and b. It can be shown that D2 is acyclic too (jAndersson et all 119971 . proof of Fact 6). We now 
claim that D\ ~j Di'- 

• D\ and D2 clearly have the same skeleton. 

• D\ and D2 have the same v-structures. Otherwise, there would be some v-structure c — >a< — b 
in L>2, or some v-structure a — > b < — c in D\. In both cases, this would imply pa G (a) 7^ 
pa G (6) \ {a}, contradicting the assumption: in the first case, c ^ X G (a) by construction (all 
edges of To (a) point away from a in D2), so c G pa G (a), but c £ pa G (6); in the second case, 
c G pa G (6), but c ^ pa G (a) by analogous arguments. 

• (d[ I] ) u = (D^Y for all I G X. Otherwise, there would be some I € X such that the skeletons 

of D^p and differ in the edge between a and b. This could only happen if |/n {a, b}\ = 1, 
in contradiction with the assumption. 

Hence, since D±, D2 G [D]x, we have D\ U D2 C G and thus a — b G G, a contradiction. This 
proves that a — > 6 is X-protected in G. 

In the second step, we show that a — > 6 is even strongly X-protected. If this was not the case, 
a — >6 would occur in configuration (d') in G, but not in configuration (a), (b), (c) or (d) (see the 
comment following Definition 1501). Define P n := {d G X G (a) | d — >6 G G}. It can be shown that 



P a is a clique G[T G (a)] (jAndersson et all 119971 . proof of Fact 7). 

Let D\ be the DAG that we get by orienting all chain components of G according to a perfect 
elimination ordering, such that, additionally, 

• all edges of Di[Tc(b)] point away from b, 

• all edges of Z?i[P a ] point towards a, 

• and all other edges of Di[T G (a)] point away from a. 

Such an orientation exists by Corollary S2J Let D2 be the digraph that we get by changing the 
orientation of the edge a — ► b in D\\ as in the first part, it can be shown that D2 is acyclic 



(jAndersson et all 1 19971 . proof of Fact 7). Again, we claim that D\ ~j D2: 



• Di and D2 clearly have the same skeleton. 

• D\ and D2 have the same v-structures. Otherwise, there would be some v-structure d — >a< — b 
in D2, or a v-structure a — > b < — d in D\. In the first case, d £ P a (otherwise, d — > b G G by 
definition of P a , and hence d — > b G D2 since ^ C G), and d Tq{o) \ P a by construction 
(edges in T G (a) \ P a point away from a in D2), hence d — >a G G and a — > b is in configuration 
(a) in G; in the second case, d G" X G (6) by construction (all edges of T G (£>) point away from 6 
in so a — > b is in configuration (b) (notation of Definition I14p in G. Both cases contradict 
the assumption. 

• Exactly as in the first part, (d[^) u = {D^ ] ) u for all I G X. 

We can conclude that, since D\,D2 G [D]x, P>\ U D2 C G, so a — 6 G G, a contradiction. ■ 
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Proof of Theorem 



(i) and (ii) follow from Proposition [T5l (iii) from Lemma H^v) 



(iv) from Corollary [13] and (v) from Lemma 



Consider the set D(G) of all DAGs that can be obtained by orienting the chain components 
of G according to a perfect elimination ordering; we have |JD(G) C G. On the other hand, for 
each undirected edge a — b £ G, there are DAGs D\ and D 2 in D(G) such that a — > b £ D\, 
a^b£D 2 (Corollary glj, hence G C U D(G). Together, we find G = \J D(G). 
We claim that D 1 ~j D 2 for any two DAGs D 1 ,D 2 € D(G): 

• D\ and D 2 have the same skeleton and the same v-structures by Proposition 1441 

• (d[^) u = f° r a h I El. Otherwise, there would be arrows a — > b £ D\, a< — b £ D 2 , 
and some I £ I such that \I n {a,&}| = 1; this would mean that a — b £ G although 



|/n {a, b}\ = 1, contradicting property (iv) 



Let D £ D(G). We have shown that D(G) C [D] x , hence G = U D ( G< ) C £r(-D). It remains to 
show that G D £i(D). 

Assume, for the sake of contradiction, that G has some arrow a — > b where £x(D) has an 
undirected edge a — b. According to property (v) 
was some I £ I such that \I R 



a — s- 6 is strongly Z-protected in G. If there 



1, the edge between a and b was also directed in £z(D) 
by Corollary 1 13t a contradiction. Hence a — >6 occurs in G in one of the config urations depicted in 
Definition 1141 Exactly as in the proof of Theorem 4.1 of Andersson et al. ( 19971 ). we can construct 
a contradiction for each of the four configurations. Although the proof given there can be used 
literally, we reproduce it here since since we will use the following steps again in the proof of 
Lemma [22l 

We assume w.l.o.g. that Tb(a) is minimal in 

A := {T £ T(G)| 3 a £ T,b £ V(G) :a^b£G,a — b£ £ X (D)} 

w.r.t. ^q, and that Tc{b) is minimal in 

B := {T £ T(G)| 3a £ T(a),b £ T : a — » b £ G, a — b £ £ X (D)}. 

Each configuration (a) to (d) of Definition [T4] leads to a contradiction (c, c\ and c 2 denote the 
vertices involved in the respective configuration): 

(a) Because of the minimality of Tq{o), c — >a must be oriented in £ X (D), hence c — > a — b is an 
induced subgraph of £ X (D), contradicting Lemma [4^i)| 

(b) a — > b < — c is then a v-structure in D, hence it is also a v-structure in £ X (D), that means 
a — > b £ £ X (D), a contradiction. 

(c) Because of the minimality of Tb(6), the edge between a and c must be oriented in £ X (D), so 
the vertices a, b and c are in one of the following configurations in £ X {D): 



Both possibilities violate Proposition ll^(i)| 

(d) The v-structure c\ — >&< — c 2 of D is also a v-structure of £ X (D), hence £ X {D) has two directed 
3-cycles (c±,b, a, ex) and (c 2 ,b, a, c 2 ), a contradiction. ■ 



Proof of Lemma 



(i) This immediately follows from Theorem HI If iii 
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(ii) Let a — > b be an arrow in £x(D); by Theorem ll£|(v)| it is strongly X-protected in £x(D). If 
there is some I & I such that \I fl {a, b}\ = 1, the arrow is by definition also strongly X- 
protected in G. Otherwise, a — > b occurs in one of the configurations (a) to (d) of Definition 
[T4"lin 8x{D). In configurations (a) to (c), the other arrows involved (a < — c; c — > b; or a — >c 
and c — >b, resp.) are also present in G, hence a — >b is strongly X-protected in G by the same 
configuration as in £z(D). 

It remains to show that if a — >b is in configuration (d) in £x(D), it is also strongly X-protected 
in G. In D, the vertices {a, c\, C2} as defined in Definition [J3] can occur in one of the following 
configurations: 

ci < — a< — C2, ci < — a — >C2, c\ — >a — >C2- 

The first and the third case are symmetric w.r.t. exchanging c\ and C2, hence we only consider 
the first two. Table [2] lists all possible configurations for the vertices {a, 01,02} in the graph 
G according to the condition D C G C £x{D). There is only one possibility for the arrow 
a — >b not to occur in one of the configurations (a) to (d) of Definition 1 14^ and hence not being 
strongly X-protected in G; however, the corresponding subgraph of {a, c%, C2}, c\ — a < — C2, 
is forbidden by Definition [19j 

(iii) According to Theorem 1 101 we have to check the following properties: 

• D\ and D2 have the same skeleton, namely Df = = G u . 

• D\ and D2 have the same v-structures: let a — > b < — c be a v-structure in D±. This 
v-structure is then also present in 8x{Di). Because of D2 C G C £x(Di), we find it also 
in G and in I?2- The argument is completely symmetric w.r.t. exchanging D\ and D2. 

• For all I el, d[ J) and have the same skeleton: assume, for the sake of contradic- 
tion, that there is some I £ I and an edge a — b that is present in (d[^) u , but not in 
(£>2 )"■ W.l.o.g., we can assume that a — *b € D\, a< — b £ D2, a € I, b £ I. Because 



of Theorem Efiv) we then have a — > b £ £x{Di) and a < — 6 6 £x{D2); however, this is 



not compatible with the requirements G C £x{D\) and G C £x{D2) 



Proof of Lemma 1211 If a — >6 € £x{D), it would be strongly X-protected by Theorem ll£|(v)[ and 
hence also strongly X-protected in G by Lemma !2C|[ii)[ contradicting the assumption. Therefore, 
a — be £ X (D) and hence DcG'c £x(D). 

Suppose that G' contains an induced subgraph of the form c — > d — e. Since G does not 
contain such an induced subgraph, it must be of the form c — > a — b or c — > b — a in G' . In both 
cases, a — >b is then strongly X-protected in G, either by configuration (a) or (b), a contradiction. ■ 



Proof of Lemma 1221 Let DcGC £j(D) be a partial X-essential graph that only has strongly 
X-protected arrows. We can literally use the second part of the proof of Theorem [18] to show 



Induced subgraph of {a, c\, 02}. ■ . 


Configuration 


. . . in D 


... in G 


of a — > 6 in G 


ci < — a < — C2 


c\ < — a < — C2 


(c) 




c\ — a < — C2 






c\< — a — C2 


(c) 




ci — a — C2 


(d) 


ci < — a — > C2 


ci < — a — > C2 


(c) 




ci — a — > C2 


(c) 




ci < — a — C2 


(c) 




ci — a — C2 


(d) 



Table 2: Possible configurations for the vertices {a, c\, 02} in the proof of Lemma [2^ii)| The labels 
in the last column refer to the configurations of Definition [J3J 
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G D £x(D); there, we only used the fact that every arrow in G is strongly Z-protected. 



B.3 Proofs for Section [4] 
Proof of Proposition 1251 "=4> ": 

(i) This claim follows from Corollary [42] 

(ii) Suppose that there is some vertex a G N \ C, that is a vertex a G N with a < — v & D. D' 
would have a directed cycle if u< — a G D, so u — >a € -D. But then, u — >a< — v is a v-structure 
in D, hence also in G, and consequently a £ nea(v), a contradiction. 

(hi) Assume that 7 = (v = ao, cti, • • ■ , cifc = u) is a shortest path from d to u in G that does 
not intersect with C. We claim that 7 is a directed path in D, which means that D' has a 
directed cycle, a contradiction. 

Suppose that the claim is wrong, and let Oj < — aj+i 6 D be the first edge of (the chain) 7 
that points away from u in D; i > 1 holds by the assumption that, in particular, a\ ^ C. 
a.;_i — > aj < — di+i cannot be an induced subgraph of D, otherwise it would also be present 

in G and hence 7 would not be a path in G. Hence aj_i a^+i G G; more precisely, 

aj—i < — aj_|_i G G (and hence also in D), otherwise there would be a shorter path from v to u 
in G than 7 that does not intersect with G. Because 7 is a path in G, aj_i, aj and aj+i can 
occur in G only in one of the following configurations: 



However, both graphs cannot be an induced subgraph of the chain graph G. 



"<= Since G is a clique in G[T G (v)), there is a DAG D G D(G) with {a G ne G (v) \ a^v G D} = C 
by Proposition and Corollary 1421 It remains to show that D' is a DAG. 

Assume, for the sake of contradiction, that D' has a directed cycle going through u — ► v. The 
return path from v to u, 7 = (w = oq, a\, . . . , = u), must come from a path in G and must 
therefore, by assumption, contain a vertex aj G G (z > 2). Since Oj — > u G D by construction, this 
means that D has a directed cycle (ao, 01, ... , Oj, ao), a contradiction. 

Uniqueness of£x(D'): Let D\,D2 G D(G) with {a G nea(v) \ a — >v G D\] = {a G nec(w) | a — >w G 
D 2 } = G, and set D[ := Di+(u, v), i = 1,2; we assume that D' 1 ,D' 2 G D+(G). To prove D' x ~x -D^ 



we have to check the following three points according to Theorem [TT (iv) 

• D[ and D' 2 obviously have the same skeleton. 

• D[ and D' 2 have the same v-structures. We already know that D\ and D2 have the same 
v-structures. Let's assume, for the sake of contradiction, that (w.l.o.g.) D[ has a v-structure 
u — * v < — a that D' 2 has not. In G, we must then have a line a — t> , hence a G nec(f). 
However, the arrow between a and v would then have the same orientation in D\ and D2 by 
construction, a contradiction. 

• For all I el, D[ {1) and D'^ have the same skeleton. If this was not the case, there would be 
some vertices a, b G \p] and some I G X such that a — >b £ D[, a< — b £ D' 2 and |/n{a, b}\ = 1. 
The arrow u — > f is part of D' x and by construction, so the arrows between a and b must 

be present in D\ and D2; however, and would then not have the same skeleton, a 
contradiction. B 



Corollary [26] is an immediate consequence of Proposition [25] and the fact that we assume the 
score function to be decomposable, so we skip the proof here. 
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Proof of Lemma 1271 Obviously, we have D' C H. To show H C £z(D'), we look at some 
edge a — b G G with a, b ^ Tg(v) and show that a — b G £%{D'). W.l.o.g., we can assume that 
a — > b G D. By Corollary 1411 there exists a D2 G D(G) that has the same orientation of edges in 
Tq(v), but an orientation of edges in Tq{cl) such that a < — 6 G L>2- By Proposition 1251 we know 
that D' 2 := -D2 + ( u : v ) is X-equivalent to D' , so in particular a — b G I?' U D 2 ^ £x{D'). 

It remains to show that o — > 6 — c does not occur as an induced subgraph of H. The inserted 
arrow u — > v cannot be part of such a subgraph, since all other edges incident to t> are oriented in 
H by construction. Since G has no such subgraph either (Theorem I18p . it could only appear in H 
through one of the newly oriented edges of Tq{v). This means that if H had an induced subgraph 
of the form a — > b — c, the corresponding vertices would be in configuration a — b — c in G; 
however, c G Tq(v) then, and so the edge between b and c would be oriented in H, a contradiction. 



Proof of Proposition [28] ": 

(i) By Corollary H2J {a G nec(v) \ a — > v G D} is a clique, hence every subset — in particular, 
C — is a clique, too. 

(ii) Assume that there is some a G C \ adc(w); then u G nec(w), otherwise u — * v — a would be 
an induced subgraph of G. Nevertheless, a G C means that u — >v< — a is a v-structure in D, 
which should hence also be present in G. 

"^=": We only must prove the existence of the claimed D G D(G), see the comment in the 
beginning of Section 14.21 We distinguish two cases: 

• u — >v € G. The existence of the DAG D G D(G) with the requested properties follows from 
Corollary H2J 

• u — v G G, hence u — a G G for all a G N because G is a chain graph. Therefore, C U {u} is 
a clique in G[nec(v)], and the existence of the claimed D again follows from Corollary 1421 

Uniqueness of £x{D'): Let D±, D2 G D(G) with u — *v G D\,Di and {a G nec(v) \ {u} | a — > v G 
D\] = {a G nec(v) \ {u} \ a — > v G D2} = C, and set D[ := — (u,v), i = 1,2. To prove 
■D x ~z £^2> we nave t° check the following three points according to Theorem HC|(iv) 

• D[ and D' 2 have the same skeleton, namely G u — (u,v) — (v,u). 

• and D' 2 have the same v-structures. Otherwise, w.l.o.g., D[ would have a v-structure 
a — > b < — c that L> 2 has not. L>i and L>2 have the same v-structures, so a — > b < — c is no 
induced subgraph of D±; this implies a = u, c = v. Since D' 2 does not have the v-structure 
u — >6< — v, the vertices u, b and v must occur in configuration u — >b — >v or u< — b — >v in D' 2 
(the configuration u< — b< — v is not consistent with the acyclicity of D2). However, all edges 
incident to v must have the same orientation in D[ and D' 2 by construction, a contradiction. 

• Let J € 2. Because of (d[ T) ) u = {D { 2 ] ) u and (D- (/) ) u = {d\ I] ) u - (u,v) - (v,u) for * = 1,2, 
we have (D'^f = (D 2 {I) ) U . U 

Corollary [29] follows quickly from Proposition l28l and the proof of Lemma [30] is very similar 
to that of Lemma [271 Therefore we skip both proofs here and proceed with the proofs of Section 

Proof of Proposition 1311 Note that we can write N = neciv) H adciu) = nec(f) D nec'(tt) 
because u — v G G and G is a chain graph. 
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(i) This follows from Corollary! 

(ii) D and D' have the same skeleton; the same is true for and D'W for all I £ I. To 
see the latter, assume that for some I € I, the intervention graphs and D'^ 1 ' have a 
different skeleton. Since D and D' only differ in the orientation of the arrow between u and 
v, the skeletons of and D'W can only differ in that u and v are adjacent in one of them 
and not adjacent in the other one. However, this would imply that \I D {u, v}\ = 1, and 



hence the edge between u and v would be directed in G by Theorem dHiv) , contradicting 



the assumption of the proposition. Finally, D 1 has at least all v-structures that D has by 
construction. 

As a consequence D 1 ^x D if and only if D' has more v-structures than D (Theorem [TOj) . 
An additional v-structure in D' must be of the form u — * v < — a. The edge between v and a 
cannot be directed in G, otherwise u — v< — a would be an induced subgraph of G, which is 



forbidden by Theorem Qilni) Hence a £ nec(v), or, more precisely, a £ C\ N. 
(iii) If N \ C is empty, the statement is trivial. Otherwise, assume that there is some shortest 
path 7 = (ao, ai, . . . , eifc) from TV \ C to C \ N in G[nec(i')] that has no vertex in C n N. 
By definition of C, at — >v £ D; furthermore, u — > ao £ D must hold, otherwise (v,ao,u,v) 
would be a directed cycle in D'. Therefore, 7 must not be a path from 00 to in D. Let 
a{ < — a.j+i be the first arrow in 7 that points away from at in D. If i = 0, u — >ao < — a\ would 
be a v-structure in D since ai ^ N: by assumption, ai ^ N n C, and a\ ^ N \ C because 

of the minimality of 7. Hence i > (and k > 1) must hold, and aj_i Oj+i in D and G, 

otherwise there would be a v-structure in D. However, 7 is not the shortest path with the 
requested properties then, a contradiction. 

From Proposition [4^1 we see that there exists a DAG D that has the requested properties, 



and in which, in addition, {a £ nec(u) \ a — > u € D} = (C n iV) U {u} (point (iv) of Proposition 



3]). The fact that D' := D — (y, u) + (u, v) D can be seen by an argument very similar to the 



proof of point (ii) above; it remains to show that D has no u-u-path except (v,u). Suppose that 
7 = (ao = v, ai, . . . , <Zjfc = u), k > 2, is such a path. In particular, 7 is then also a v-u-p&th in G, 
hence 7 lies completely in Tq{v). 

If k = 2, then ai £ iV, and so the vertices u, v and 01 occur in one of the following configurations 
in D by Proposition 1431 

v > u , v > u . 

Both configurations contradict the assumption that 7 = (y,ai,u) forms a path in D. Thus 
we conclude k > 3, and we notice a^-i G nec(u) \ {v}. If a^-i £ C, a^_i — > v £ D, hence 
(ao, ai, . . . , afc-i, cto) would be a cycle in D. On the other hand, if at-i £ C, we would have 
afc_i < — u € Z), so 7 would not be a path in D. 

Uniqueness of Sx(D'): Let D\,D2 £ D(G) with — v € D\,Di and {a G nec(f) | a — >f € Z?i} = 
{a 6 nec(w) | — >v £ D2} = C, and set D[ := Di — (v,u) + (u,v), i = 1,2; we assume that 
D'i,D 2 £ D°(G). As in the proofs of Proposition [25l we can check that D[ ~j D 2 : 

• D[ and D2 obviously have the same skeleton. 

• D\ and D2 have the same v-structures. If this does not hold for D[ and D' 2 , (w.l.o.g.) D[ 
must have a v-structure u — > v < — o that D2 ^ as n °t- Since u — f < — a cannot be an induced 
subgraph of G, a £ neciv); however, the edges between v and its neighbors are oriented in 
the same way in D[ and D' 2 by construction, a contradiction. 

• For all I £ I, D 1 ^ and D 2 have the same skeleton: this can be seen by an argument very 
similar to that in the proof of Proposition 
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Proof of Corollary 1321 We have to show pa D (v) = pa G (v) U C and pa D (u) = pa G (u) U (C D 
N) U {v}. The first identity is immediately clear. For the second identity, note that for any ver- 
tex a G C n N, the arrow between a and it must be oriented as a — > it G D because the other 
orientation would induce a 3-cycle. On the other hand, we have a < — it G D for a £ N \ C be- 
cause a different orientation would induce a 3-cycle in D' . Finally, we also have a< — u G D for any 
a G neG(u)\(neG(f)U{f }) since the other orientation would induce a v-structure v — >u< — a in D. ■ 

Lemma [33] can be proven very similarly as Lemma [271 Finally, we finish this proof section with 
the proof of Proposition [33] characterizing a step of the turning phase of GIES for the case that 
we turn an X-essential arrow in some representative D G D(G). We will omit the proof of Lemma 
1351 since it can be proven similarly to Lemma [27J 

Proof of Proposition [34] When v — '«£G (that is, it and v lie in different chain components), 
N = nec(v) fl adc(u) = nec(v) n pa G (u) holds because G is a chain graph. 



(i) This point follows from Corollary 1421 

(ii) If this was not true, D' would have a cycle of the form (u, v, a, u) for some a € N since 
N C pa G (u). 

(iii) Suppose that the path 7 = (ao = i>, ai, . . . , = u) is a shortest counterexample of a path 
without vertex in C U nee (u) . 

Assume that k = 2. Since it and v lie in different chain components, the vertices it, u and a\ 
can occur in one of the following configurations in G: 



v > it , 11 > it , u > it . 

ai ai ai 

The first case implies the existence of a directed cycle in Z)'; in the second case, 01 £ JV C C, 
in the third case, a± € nee (it)- 

Therefore k > 3. In complete analogy to the proof of Proposition [25] we can show that 7 is 
also a u-it-path in D, hence D' has a directed cycle, a contradiction. 



"<=": Let D e D(G) be a DAG with {a G ne G (i;) | a — > v G I?} = C and in which all edges of 
D[Tg(u)] point away from it; such a DAG exists by Corollary 1421 and meets the requirements of 
Proposition [3H It remains to show that D' is acyclic, that means that D has no u-it-path except 
(v,u). 

Suppose, for the sake of contradiction, that D has such a path 7 = (ao = v, ai, . . . , = u). 
7 is then also a -u-it-path in G, hence there is, by assumption, some a; £ C U P. If G C, 
(ao, ai, . . . , aj, ao) would be a cycle in D; on the other hand, if G P, (aj, ctj+i, • • ■ , ak-> a i) would 
be a cycle in D, a contradiction. 

Uniqueness of Sx(D'): The proof given for Proposition [311 is also valid here. ■ 



Proof of Corollary 1361 The fact that pa£>(i>) = p&q(v) U C is clear from Proposition 1341 it 
remains to show that pa D (u) = pa G (u). Any neighbor a of 11 must also be a child of v, oth- 



erwise G would have a subgraph of the form v — > u — a, which is forbidden by Theorem [TJ pn) 
Hence a< — 11 G -D for all a G nee (it) since the other orientation would imply a directed cycle in D' . I 
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